July 16th, 2024

Exo: Run your own AI cluster at home with everyday devices

The "exo" project on GitHub guides users in creating a home AI cluster with features like LLaMA support, dynamic model splitting, ChatGPT API, and MLX inference. Installation involves cloning the repository and installing requirements. iOS implementation may lag.

Read original articleLink Icon
NetworkPerformanceSkepticism
Exo: Run your own AI cluster at home with everyday devices

The GitHub repository for the "exo" project offers details on setting up an AI cluster at home using everyday devices. Key features include support for popular models like LLaMA, dynamic model splitting based on network and device resources, automatic device discovery, and a ChatGPT-compatible API. Installation is recommended from the source by cloning the repository and installing requirements via pip. Documentation includes examples for multi-device usage, a ChatGPT-like web interface on each device, and an API endpoint for model interaction. Supported inference engines are MLX, tinygrad, and llama.cpp, with networking support for GRPC modules. Notably, the iOS implementation is rapidly evolving but may lag behind the Python version. For more information, the GitHub repository provides comprehensive details.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Gemma 2 on AWS Lambda with Llamafile

Gemma 2 on AWS Lambda with Llamafile

Google released Gemma 2 9B, a compact language model rivaling GPT-3.5. Mozilla's llamafile simplifies deploying models like LLaVA 1.5 and Mistral 7B Instruct, enhancing accessibility to powerful AI models across various systems.

MobileLLM: Optimizing Sub-Billion Parameter Language Models for On-Device Use

MobileLLM: Optimizing Sub-Billion Parameter Language Models for On-Device Use

The GitHub repository contains MobileLLM code optimized for sub-billion parameter language models for on-device applications. It includes design considerations, code guidelines, outcomes on common sense reasoning tasks, acknowledgements, and licensing details. Contact repository individuals for support.

Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c

Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c

The GitHub repository focuses on the "llm.c" project by Andrej Karpathy, aiming to implement Large Language Models in C/CUDA without extensive libraries. It emphasizes pretraining GPT-2 and GPT-3 models.

Distributed LLama3 Inference

Distributed LLama3 Inference

The GitHub repository for `Cake` hosts a Rust implementation of LLama3 distributed inference, aiming to utilize consumer hardware for running large models across various devices. Instructions and details are available for setup and optimizations.

AI: What people are saying
The GitHub "exo" project for creating a home AI cluster generates mixed reactions.
  • Concerns about network bottlenecks and performance issues when running models over a home network.
  • Questions about the feasibility and practicality of using the project on non-Apple devices and the lack of benchmarks.
  • Discussions on the potential benefits of local AI compute for privacy and utilizing unused CPU resources.
  • Interest in the project's potential for crowdsourcing and collaborative model training.
  • Security and licensing concerns are raised by some users.
Link Icon 28 comments
By @ajnin - 6 months
It requires mlx but it is an Apple silicon-only library as far as I can tell. How is it supposed to be (I quote) "iPhone, iPad, Android, Mac, Linux, pretty much any device" ? Has it been tested on anything else than the author's MacBook ?
By @dcreater - 6 months
This is a great ideal and user friendly as well. Has the potential of converting multiple old devices overnight from being useless. However, I wish they had provided some results on tok/s, latency with some example setups.
By @mg - 6 months

    This enables you to run larger models
    than you would be able to on any single
    device.
No further explanation on how this is supposed to work?

If some layers of the neural network are on deviceA and some layers are on deviceB, wouldn't that mean that for every token generated, all output data from the last layer on deviceA have to be transferred to deviceB?

By @pyinstallwoes - 6 months
Swarm compute should be the norm for all compute - so much unused cpu across all the devices we collectively own.
By @hagope - 6 months
I used to be excited about running models locally (LLM, stable diffusion etc) on my Mac, PC, etc. But now I have resigned to the fact that most useful AI compute will mostly be in the cloud. Sure, I can run some slow Llama3 models on my home network, but why bother when it is so cheap or free to run it on a cloud service? I know Apple is pushing local AI models; however, I have serious reservations about the impact on battery performance.
By @matyaskzs - 6 months
Cloud cannot be beaten on compute / price, but moving to local could solve privacy issues and the world needs a second amendment for compute anyway.
By @cess11 - 6 months
I look forward to something similar being developed on top of Bumblebee and Axon, which I expect is just around the corner. Because, for me, Python does not spark joy.
By @Jayakumark - 6 months
Just got https://github.com/distantmagic/paddler working across 2 machines on windows, for load balancing, This will be next level and useful for Llama 400B to run across multiple machines. But looks like windows support is not there yet.
By @fudged71 - 6 months
Since this is best over a local network, I wonder how easy you could make the crowdsourcing aspect of this. How could you make it simple enough for everyone that's physically in your office to join a network to train overnight? Or get everyone at a conference to scan a QR code to contribute to a domain specific model.
By @makmanalp - 6 months
Question - if large clusters are reporting that they're seeing gains from using RDMA networks because communication overhead is a bottleneck, how is it possible that this thing is not massively bottlenecked running over a home network?
By @pierrefermat1 - 6 months
Would be great if we could get some benchmarks on commonly available hardware setups.
By @gnicholas - 6 months
This is great! I really wish Apple allowed your device to query a model you host instead of skipping to their cloud (or OpenAI). I'd love to have a Studio Pro running at home, and have my iPhone, iPad, Mac, and HomePod be able to access it instead of going to the cloud. That way I could have even more assured privacy, and I could choose what model I want to run.
By @christkv - 6 months
Is apple silicon with a lot of memory 32Gb and up still considered a cheapish way to run models or are there other options now?
By @whoami730 - 6 months
Is it possible to use this for image recognition and like? Not sure what can be the usage of this apart from as a chatbot.
By @Aerbil313 - 6 months
I can't wait to see malware which downloads and runs LLMs on remote C&C server command.
By @tarasglek - 6 months
This is the first timer i've seen tinygrad backend in the wild. Amusing that it's supposedly more stable than llama.cpp for this project.
By @throwaway2562 - 6 months
How long before the accursed crypto kids try to tokenise token generation with Exo clusters?
By @rbanffy - 6 months
The all important question:

When there’s only one device left on the network, will it sing Daisy Bell?

By @thom - 6 months
Bexowulf.
By @ulrischa - 6 months
Does somebody know if it runs on a raspberry?
By @pkeasjsjd - 6 months
It bothers me that they don't talk about security here, I don't like it at all.
By @iJohnDoe - 6 months
Anyone run this? Works?
By @Obertr - 6 months
Okey, I'll say it. It will not work because of network bottlneckes. You need to be sending gigabytes of Data.

so by definition you need (1) good internet 20mb/s+ and (2) good devices.

This thing will not go any further than cool demo on twitter. Please prove me wrong.

By @throwawaymaths - 6 months
Is this sensible? Transformers are memory bandwidth bound. Schlepping activations around your home network (which is liable to be lossy) seems like it would result in atrocious TPS.
By @yjftsjthsd-h - 6 months
Unfortunately I don't see any licensing info, without which I'm not touching it. Which is too bad since the idea is really cool.