July 28th, 2024

How to Run Llama 3 405B on Home Devices? Build AI Cluster

The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.

Read original articleLink Icon
How to Run Llama 3 405B on Home Devices? Build AI Cluster

The article discusses how to run the Llama 3.1 405B model on home devices by building an AI cluster using the Distributed Llama project. Open LLM models like Llama can be run locally, which eliminates reliance on external providers, but running large models requires significant memory and resources. Tensor parallelism is highlighted as a method to speed up matrix multiplication across multiple devices, although synchronization can be a bottleneck due to slower home network connections. The Distributed Llama project allows users to run LLMs across multiple devices, utilizing a root node for coordination and worker nodes for execution. To set up, users need to clone the Distributed Llama repository, connect devices to a local network, and run the necessary commands to configure the nodes. The root node requires the Llama 3.1 model, which must be downloaded and converted to a compatible format. The process involves significant disk space and time for conversion. Once set up, users can run inference commands on the root node, specifying worker nodes and other parameters. The article also mentions the option to run an API service and manage RAM usage by utilizing disk storage for the KV cache. Overall, the guide provides a comprehensive overview of the steps needed to successfully run the Llama 3.1 model on a home AI cluster.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

How to run an LLM on your PC, not in the cloud, in less than 10 minutes

You can easily set up and run large language models (LLMs) on your PC using tools like Ollama, LM Suite, and Llama.cpp. Ollama supports AMD GPUs and AVX2-compatible CPUs, with straightforward installation across different systems. It offers commands for managing models and now supports select AMD Radeon cards.

Llama 3.1 Official Launch

Llama 3.1 Official Launch

Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.

Llama 3.1: Our most capable models to date

Llama 3.1: Our most capable models to date

Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.

Why Llama 3.1 is Important

Why Llama 3.1 is Important

Meta's new Llama 3.1 405B model prioritizes data sovereignty, open-source accessibility, cost savings, independence, and advanced customization. It aims to boost AI innovation by empowering companies with control and flexibility.

Link Icon 5 comments
By @FloatArtifact - 4 months
An interesting idea, but there's no discussion about benchmarks in the article, cpu or other hardware requirements beyond 230 GB RAM for the cluster.

It seems impractical that a home would have 4 machines with 64 gb ram that would be dedicated to a distributed system. Max, core count, 16 cores from Consumer AMD CPUs? From a cost perspective build a system with 256gb ram and AMD Epic CPU?

The only thing I can think of that pushes a need for a distributed system is multi-GPU across multiple systems.

By @atemerev - 4 months
But this is not a problem, if you have money to buy 1TB memory, you can easily get a motherboard that supports it for around $800-$1000. It will be cheaper than attempting to build a cluster with 4x256GB RAM.

GPU inference is another thing, as high-VRAM GPUs are artificially priced way high so they could only be bought by corporations. However, if you attempt to build a cluster with say 10 4090s to obtain some 240GB VRAM, you won’t have enough electricity to run it at home.

I am currently building a 4x4090 rig, but that’s probably the maximum I could have at home giving my budget / available power restrictions. And that’s only 96G VRAM, slightly more than a single A100.

By @gumboshoes - 4 months
A prediction: distributed LAN AI will soon be a normal network device, as simple as a media server to install and run. Querying your home AI instead of Alexa or Siri. Desktop integration. Hints of Tony Stsrk's Jarvis.
By @gumboshoes - 4 months
A prediction: distributed LAN AI will soon be a normal network device, as simple as a media server to install and run. Querying your home AI instead of Alexa or Siri. Desktop integration. Hints of Tony Stsrk's Jarvis.
By @cryptoboid - 4 months
I love this. How does it compare to something like https://petals.dev/?