October 3rd, 2024

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.

Read original articleLink Icon
Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

The paper titled "TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices" addresses the challenges of running large language models (LLMs) on edge devices, which often have limited computing resources. The authors propose a new system called TPI-LLM that utilizes tensor parallelism, which they argue is more effective than pipeline parallelism for single-user scenarios. TPI-LLM is designed to keep sensitive data local to the user's device and employs a sliding window memory scheduler to manage layer weights during inference, thereby overlapping disk I/O latency with computation and communication. This approach allows for the efficient operation of 70B-scale models on devices with constrained memory. The study identifies link latency as a significant communication bottleneck and introduces a star-based allreduce algorithm to mitigate this issue. Experimental results show that TPI-LLM significantly reduces time-to-first-token and token latency compared to existing frameworks, while also decreasing the peak memory requirement for the Llama 2-70B model by 90%, needing only 3.1 GB of memory. The paper is currently under review.

- TPI-LLM is designed for efficient inference of 70B-scale LLMs on low-resource edge devices.

- The system utilizes tensor parallelism and a sliding window memory scheduler to optimize performance.

- It keeps sensitive data local to the user's device, addressing privacy concerns.

- TPI-LLM reduces time-to-first-token and token latency significantly compared to other frameworks.

- The peak memory requirement for Llama 2-70B is reduced by 90%, requiring only 3.1 GB of memory.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI

Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.

How to Run Llama 3 405B on Home Devices? Build AI Cluster

How to Run Llama 3 405B on Home Devices? Build AI Cluster

The article explains how to run the Llama 3.1 405B model on home devices using the Distributed Llama project, detailing setup, resource requirements, and methods for efficient execution across multiple devices.

How to evaluate performance of LLM inference frameworks

How to evaluate performance of LLM inference frameworks

LLM inference frameworks face a "memory wall" limiting performance. Developers should choose frameworks wisely, apply optimizations cautiously, and structure applications for server or offline scenarios to enhance efficiency.

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.

Link Icon 7 comments
By @vessenes - about 2 months
This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.

They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.

That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.

I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.

Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.

By @adam_arthur - about 2 months
While I do think there's going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won't be as big as investors are betting on.

Most of the rewards will be reaped by consumers rather than providers.

We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.

How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.

It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.

By @loufe - about 2 months
It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.
By @Zetaphor - about 2 months
Is this different from (or related to) the work being done by the exo project?

https://github.com/exo-explore/exo

By @tonetegeatinst - about 2 months
While training seems to be out of reach for the average tech user unless they have a data center for a homelab or a very large income, SOTA models can be easily run on the edge devices either on a phone or a dedicated computer/server.

LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.

By @dvh - about 2 months
So when will I be able to "sudo apt-get install llm" ?
By @tgtweak - about 2 months
Is there a cuda implementation of this... asking for a friend