Efficient Execution of Structured Language Model Programs
SGLang is a new system for executing complex language model programs, featuring a frontend language and runtime optimizations. It offers significant throughput improvements and is publicly available for further exploration.
Read original articleSGLang is a newly introduced system designed for the efficient execution of complex language model programs, addressing the growing need for effective programming and execution in applications that utilize large language models (LLMs). The system comprises a frontend language that simplifies programming through the use of primitives for generation and parallelism control, alongside a runtime that enhances execution speed with innovative optimizations. Notable features include RadixAttention, which facilitates key-value cache reuse, and compressed finite state machines that improve the speed of structured output decoding. Experimental results indicate that SGLang can achieve up to 6.4 times higher throughput compared to existing state-of-the-art inference systems across various tasks. These tasks encompass agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat interactions. The authors of the paper, including Lianmin Zheng and ten others, have made the code publicly available, promoting further exploration and application of SGLang in the field of artificial intelligence and programming languages. The paper was submitted on December 12, 2023, and revised on June 6, 2024, indicating ongoing development and refinement of the system.