November 10th, 2024

Everything I've learned so far about running local LLMs

Local Large Language Models (LLMs) now run on modest hardware, enhancing accessibility. The llama.cpp software simplifies usage, while Hugging Face offers various models. Understanding specifications is vital for optimization.

Read original article

Everything I've learned so far about running local LLMs

exploration of local Large Language Models (LLMs) reveals significant advancements in accessibility and performance. Users can now run LLMs on modest hardware, such as Raspberry Pi or standard desktops, offering a private, offline, and registration-free experience. The article emphasizes the rapid evolution of LLM technology, making it essential to stay updated with the latest developments. The author shares practical insights on running LLMs, focusing on the llama.cpp software, which simplifies CPU inference without the complexities of Python. For GPU inference, the amount of video RAM (VRAM) is crucial, with recommendations for models based on available resources. The author highlights Hugging Face as a key resource for downloading models, particularly GGUF formats suitable for llama.cpp. Various models are discussed, including Mistral-Nemo-2407, Qwen models, and Google's Gemma, each with unique strengths and weaknesses. The author notes the importance of understanding model specifications and quantization options to optimize performance. Overall, the article serves as a guide for those interested in exploring local LLMs, providing insights into software, models, and practical usage.

- Local LLMs can now be run on modest hardware, enhancing accessibility.

- The llama.cpp software simplifies the process of running LLMs without complex dependencies.

- Hugging Face is a primary source for downloading various LLM models in GGUF format.

- Different models have unique strengths, making it essential to choose based on specific use cases.

- Staying updated with the rapid advancements in LLM technology is crucial for effective usage.

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.

How to Run Llama 3 405B on Home Devices? Build AI Cluster

The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.

Nvidia releases NVLM 1.0 72B open weight model

NVIDIA launched NVLM 1.0, featuring the open-sourced NVLM-D-72B model, which excels in multimodal tasks, outperforms competitors like GPT-4o, and supports multi-GPU loading for text and image interactions.

Llama.cpp Now Part of the Nvidia RTX AI Toolkit

NVIDIA's RTX AI platform supports llama.cpp, a lightweight framework for LLM inference, optimized for RTX systems, enhancing performance with CUDA Graphs and facilitating over 50 application integrations.

15 comments

By @TZubiri - 6 months

"I’ve exclusively used the astounding llama.cpp. Other options exist, but for basic CPU inference — that is, generating tokens using a CPU rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In particular, no Python fiddling that plagues much of the ecosystem. On Windows it will be a 5MB llama-server.exe with no runtime dependencies"

Will definitely give llama.cpp a go, great selling point.

I've tried running both Meta Llama and Gpt2 and they both relied on some complex virtualization toolchain of either docker, or a thing called conda, and the dependency list was looong, any issue at any point caused a blockage. I tried on 3 machines, and in a whole day, as a somewhat senior dev I couldn't get it running.

By @accrual - 6 months

As a retro PC hobbyist I loved this line:

> Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

I wonder how difficult it would be to compile modern C++ on XP. I may give it a shot and reach out to the author if needed! :)

By @jijji - 6 months

Although I like the article, the author doesnt acknowledge the current way that (I think) most people are utilizing llama.cpp at this point. ollama.com has simplified his work into two lines:

curl -fsSL https://ollama.com/install.sh |sh

ollama run llama3.2

By @kkfx - 6 months

I have a much more condensed "sysadmin" experience, who summarize as:

- I have found some personal use cases, but no LLMs I've found do really work for such cases;

- those who publish LLM-alike software (also valid for SD and alike for images and co) have no interest in FLOSS, they simply push code with no structure, monsters with deps not handled at all and next to zero documentation, seems more OSS-enterprise trend than FLOSS.

Long story short: my personal use case is find hard-to-find notes (org-mode, maaaaany headings, much of the annotated news in various languages), where hard to find notes is "if I recall correctly I've noted ~$something but still fail to find it both looking for headings (let's say titles) and ripgrepping brutally" and to spot trends "I've noted various natural phenomenons in the last some years, how about the trend of noted floods, wildfires, ...?". In all cases I've managed quicker and better with simply org-roam-node-find (i.e. looking at titles) (+ embark eventually) on results or rg(+embark).

That's is. They might be useful like Alphabet NotebookLM for quickly trying to have a clue on a pdf someone sent to me, but so far I found nothing interesting who not demand more time packaging and keep updating the project and it's deps on my desktop than simply skim papers by myself...

By @aperrien - 6 months

I'm designing a new PC and I'd like to be able to run local models. It's not clear to me from posts online what the specs should be. Do I need 128gb of RAM? Or would a 16gb RTX 4060 be better? Or should I get a 4070 ti? If anyone could pint me toward some good guidelines I'd greatly appreciate it.

By @ekianjo - 6 months

There's a lot more than the few applications described at the end of the article. Even with smaller sized models, they can achieve many useful tasks when editing text, making summaries (of not too long documents), writing reasonable emails, expand on existing text, add details to a document, change the turn of phrase, imitate someone's writing style... and more!

RAG is a very difficult topic. A basic RAG will just be crap, and fail to answer questions properly most of the time. Once you however accumulate techniques to improve beyond the baseline, it can become something very similar to a very proficient assistant on a specific domain (assuming you indexed the files of interest) and doubles as a local search engine.

LLMs have many limitations, but once you understand their constraints, they can still do a LOT.

By @suprjami - 6 months

Llama.cpp has Vulkan text generation, so you can use the GPU without any special drivers.

It's at least 10x faster than CPU generation and turns small models (up to 7B parameters) into an experience as fast as any of the commercial services.

By @ww520 - 6 months

Thanks for the wonderful article.

I’ve tried running models locally. I found that collocating the models on my computer/laptp took up too much resource to impact my work. My solution is to run the models on my home servers since they can be served via http. Then run VPN to my home network to access them if I’m on the road. That actually works well and it’s scalable.

By @2-3-7-43-1807 - 6 months

  "Inference starts at a comfortable 30 t/s

is this including the context? context: 1000t and instruction: 20t takes (1020/30 s)? or 20/30 s?

  "Second, LLMs have goldfish-sized working memory. ... In practice, an LLM can hold several book chapters worth of comprehension “in its head” at a time. For code it’s 2k or 3k lines (code is token-dense).

That's not exactly goldfish-sized and in fact very useful already.

  "Third, LLMs are poor programmers. At best they write code at maybe an undergraduate student level who’s read a lot of documentation.

Exactly what I want for local code generation.

I think he's anti-hyping a little by pretending LLMs are in fact _not_ super-intelligent and what not. Sure, some people believe that but come on ... we're not on a McKinsey workshop here.

---

Any good German language models out there?

By @sowbug - 6 months

> There are tools like retrieval-augmented generation and fine-tuning to mitigate it… slightly.

On one hand, the imminent arrival of J.A.R.V.I.S. makes me wish I'd digitized more of my personal life. Keeping a daily journal for the past couple decades would have been an amazing corpus for training an intelligent personal LLM.

On the other hand, I often feel like I dodged a bullet by being born just before the era of social-media oversharing, meaning that not all evidence of my life is already online. I've assumed since Her came out that such a product would require giving up all your privacy to Big Tech, Inc.

Articles like this give me hope that there will someday be competent digital second brains that we can run entirely locally, and that it might be time to start that journal... but only in Notepad.

By @neves - 6 months

Nice article!

I'm always in doubt: in a Windows computer without a powerful GPU, is it better to run the local models in WSL2 or directly in Windows? Does that fact that is an ARM machine makes any difference?

By @jamietanna - 6 months

To the author:

> which smashes the Turing test and can be .

Looks like an incomplete sentence?

By @trash_cat - 6 months

Good write-up.

Anyone want to give a definition of GGUF without using IBM's definition (also appears first in my search results)?

By @informal007 - 5 months

flash-attention is a additional and later technique to accelerate computation of attention, That's why it's not the default option for llama.cpp

Everything I've learned so far about running local LLMs

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

How to Run Llama 3 405B on Home Devices? Build AI Cluster

Nvidia releases NVLM 1.0 72B open weight model

Llama.cpp Now Part of the Nvidia RTX AI Toolkit

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

How to Run Llama 3 405B on Home Devices? Build AI Cluster

Nvidia releases NVLM 1.0 72B open weight model

Llama.cpp Now Part of the Nvidia RTX AI Toolkit