Llama.cpp guide – Running LLMs locally on any hardware, from scratch
The guide on SteelPh0enix's blog details running large language models locally using llama.cpp, highlighting hardware options, quantization benefits, setup instructions, and encouraging non-commercial self-hosting experimentation.
Read original articleThe guide on SteelPh0enix's blog provides a comprehensive overview of running large language models (LLMs) locally using llama.cpp, a software that allows users to self-host LLMs on various hardware configurations. The author shares their journey from skepticism about AI to successfully running LLMs on a personal GPU setup. They clarify that while a high-end GPU can enhance performance, it is not strictly necessary; modern CPUs can also run LLMs effectively, albeit with varying performance levels. The guide emphasizes the importance of quantization, which enables LLMs to run on less powerful hardware, including devices like Raspberry Pi. It also discusses prerequisites for running llama.cpp, including hardware specifications and software dependencies, and provides step-by-step instructions for building and setting up the software on both Windows and Linux. The author encourages users to explore self-hosting LLMs for non-commercial purposes while noting that commercial use may require different considerations. Overall, the guide serves as a valuable resource for those interested in experimenting with LLMs locally.
- Users can run LLMs on various hardware, including CPUs and GPUs.
- Quantization allows LLMs to operate on less powerful devices.
- The guide includes detailed instructions for building and setting up llama.cpp.
- Self-hosting LLMs is encouraged for non-commercial use.
- Performance varies based on hardware and model size.
Related
Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU
The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.
How to run an LLM on your PC, not in the cloud, in less than 10 minutes
You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.
How to Run Llama 3 405B on Home Devices? Build AI Cluster
The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.
Llama.cpp Now Part of the Nvidia RTX AI Toolkit
NVIDIA's RTX AI platform supports llama.cpp, a lightweight framework for LLM inference, optimized for RTX systems, enhancing performance with CUDA Graphs and facilitating over 50 application integrations.
Everything I've learned so far about running local LLMs
Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.
Assuming you want to do this iteratively (at least for the first time) should only need to run:
ccmake .
And toggle the parameters your hardware supports or that you want (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple etc..), and press 'c' (configure) then 'g' (generate), then: cmake --build . -j $(expr $(nproc) / 2)
Done.If you want to move the binaries into your PATH, you could then optionally run cmake install.
Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.
Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.
Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?
#!/bin/sh
export OLLAMA_MODELS="/mnt/ai-models/ollama/"
printf 'Starting the server now.\n'
ollama serve >/dev/null 2>&1 &
serverPid="$!"
printf 'Starting the client (might take a moment (~3min) after a fresh boot).\n'
ollama run llama3.2 2>/dev/null
printf 'Stopping the server now.\n'
kill "$serverPid"
And it just works :-)I find that running PyTorch is easier to get up and running. For quantization, AWQ models work and it's just a "pip install" away.
sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential
# add yourself to the video and render groups
sudo usermod -aG video,render $USER
# reboot to apply the group changes
# download a model
wget --continue -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \
https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true
# build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b3267
HIPCXX=clang++-17 cmake -S. -Bbuild \
-DGGML_HIPBLAS=ON \
-DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" \
-DCMAKE_BUILD_TYPE=Release
make -j8 -C build
# run llama.cpp
build/bin/llama-cli -ngl 32 --color -c 2048 \
--temp 0.7 --repeat_penalty 1.1 -n -1 \
-m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \
--prompt "Once upon a time"
I think this will also work on Rembrandt, Renoir, and Cezanne integrated GPUs with Linux 6.10 or newer, so you might be able to install the HWE kernel to get it working on that hardware.With that said, users with CDNA 2 or RDNA 3 GPUs should probably use the official AMD ROCm packages instead of the built-in Ubuntu packages, as there are performance improvements for those architectures in newer versions of rocBLAS.
Spoiler: Vulkan with MSYS2 was indeed the easiest to get up and running.
I actually tried w64devkit first and it worked properly for llama-server, but there were inexplicable plug-in problems with llama-bench.
Edit: I tried w64devkit before I read this write-up and I was left wondering what to try next, so the timing was perfect.
With llama.cpp running on a machine, how do you connect your LLM clients to it and request a model gets loaded with a given set of parameters and templates?
... you can't, because llama.cpp is the inference engine - and it's bundled llama-cpp-server binary only provides relatively basic server functionality - it's really more of demo/example or MVP.
Llama.cpp is all configured at the time you run the binary and manually provide it command line args for the one specific model and configuration you start it with.
Ollama provides a server and client for interfacing and packaging models, such as:
- Hot loading models (e.g. when you request a model from your client Ollama will load it on demand).
- Automatic model parallelisation.
- Automatic model concurrency.
- Automatic memory calculations for layer and GPU/CPU placement.
- Layered model configuration (basically docker images for models).
- Templating and distribution of model parameters, templates in a container image.
- Near feature complete OpenAI compatible API as well as it's native native API that supports more advanced features such as model hot loading, context management, etc...
- Native libraries for common languages.
- Official container images for hosting.
- Provides a client/server model for running remote or local inference servers with either Ollama or openai compatible clients.
- Support for both an official and self hosted model and template repositories.
- Support for multi-modal / Vision LLMs - something that llama.cpp is not focusing on providing currently.
- Support for serving safetensors models, as well as running and creating models directly from their Huggingface model ID.
In addition to the llama.cpp engine, Ollama are working on adding additional model backends (e.g. things like exl2, awq, etc...).Ollama is not "better" or "worse" than llama.cpp because it's an entirely different tool.
ollama didn’t have the issue, but it’s less configurable.
One thing I’m unsure of is how to pick a model. I downloaded the 7B one from Huggingface, but how is anyone supposed to know what these models are for, or if they’re any good?
What do you use Llama.cpp for?
I get you can ask it a question in natural language and it will spit out sort of an answer, but what would you do with it, what do you ask it?
Related
Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU
The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.
How to run an LLM on your PC, not in the cloud, in less than 10 minutes
You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.
How to Run Llama 3 405B on Home Devices? Build AI Cluster
The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.
Llama.cpp Now Part of the Nvidia RTX AI Toolkit
NVIDIA's RTX AI platform supports llama.cpp, a lightweight framework for LLM inference, optimized for RTX systems, enhancing performance with CUDA Graphs and facilitating over 50 application integrations.
Everything I've learned so far about running local LLMs
Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.