October 15th, 2024

Running LLMs with 3.3M Context Tokens on a Single GPU

DuoAttention enhances long-context LLM inference by optimizing memory and reducing latency, achieving significant memory savings and acceleration in processing while maintaining accuracy, with implementation code available for research.

Read original articleLink Icon
Running LLMs with 3.3M Context Tokens on a Single GPU

DuoAttention is a novel framework designed to enhance the efficiency of long-context large language model (LLM) inference by optimizing memory usage and reducing latency. Traditional methods for caching Key and Value (KV) states across attention heads are often inefficient, leading to significant memory consumption and performance issues. DuoAttention addresses this by distinguishing between two types of attention heads: Retrieval Heads, which require full attention for processing long contexts, and Streaming Heads, which focus on recent tokens and can operate with a lightweight, constant-length KV cache. This approach allows DuoAttention to significantly reduce memory usage—up to 2.55 times for Multi-Head Attention (MHA) models and 1.67 times for GQA models—while also accelerating decoding and pre-filling processes by up to 2.18 times and 1.73 times, respectively, with minimal accuracy loss. The framework's effectiveness is further enhanced when combined with quantization, enabling the Llama-3-8B model to handle a context length of 3.3 million on a single A100 GPU. The authors provide code for implementation, making it accessible for further research and application.

- DuoAttention optimizes long-context LLM inference by differentiating between Retrieval and Streaming Heads.

- The framework reduces memory usage significantly, achieving up to 2.55x reduction for MHA models.

- Decoding and pre-filling processes are accelerated by up to 2.18x and 1.73x, respectively.

- Minimal accuracy loss is observed compared to traditional full attention methods.

- The method supports large context lengths, enabling advanced model capabilities on single GPUs.

Link Icon 3 comments
By @charlie_xxx - 6 months
Their demo looks really cool: https://github.com/mit-han-lab/duo-attention