Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]
The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.
Read original articleThe paper titled "TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices" addresses the challenges of running large language models (LLMs) on edge devices, which often have limited computing resources. The authors propose a new system called TPI-LLM that utilizes tensor parallelism, which they argue is more effective than pipeline parallelism for single-user scenarios. TPI-LLM is designed to keep sensitive data local to the user's device and employs a sliding window memory scheduler to manage layer weights during inference, thereby overlapping disk I/O latency with computation and communication. This approach allows for the efficient operation of 70B-scale models on devices with constrained memory. The study identifies link latency as a significant communication bottleneck and introduces a star-based allreduce algorithm to mitigate this issue. Experimental results show that TPI-LLM significantly reduces time-to-first-token and token latency compared to existing frameworks, while also decreasing the peak memory requirement for the Llama 2-70B model by 90%, needing only 3.1 GB of memory. The paper is currently under review.
- TPI-LLM is designed for efficient inference of 70B-scale LLMs on low-resource edge devices.
- The system utilizes tensor parallelism and a sliding window memory scheduler to optimize performance.
- It keeps sensitive data local to the user's device, addressing privacy concerns.
- TPI-LLM reduces time-to-first-token and token latency significantly compared to other frameworks.
- The peak memory requirement for Llama 2-70B is reduced by 90%, requiring only 3.1 GB of memory.
Related
Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU
The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI
Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.
How to Run Llama 3 405B on Home Devices? Build AI Cluster
The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.
How to evaluate performance of LLM inference frameworks
LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.
That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.
I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.
Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
Most of the rewards will be reaped by consumers rather than providers.
We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.
How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.
It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.
LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.
Related
Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU
The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI
Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.
How to Run Llama 3 405B on Home Devices? Build AI Cluster
The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.
How to evaluate performance of LLM inference frameworks
LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.