70% Size, 100% Accuracy: Lossless LLM Compression via Dynamic-Length Float
The paper presents DFloat11, a compression framework that reduces large language model size by 30% while maintaining accuracy, enhancing throughput, and enabling lossless inference on multiple GPUs.
Read original articleThe paper titled "70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float" introduces a novel compression framework called Dynamic-Length Float (DFloat11) aimed at reducing the size of large language models (LLMs) by 30% while maintaining bit-for-bit accuracy. The authors, including Tianyi Zhang and others, highlight the inefficiencies in the BFloat16 weight representation of LLMs, which DFloat11 addresses through entropy coding. This method assigns dynamic-length encodings to weights based on their frequency, achieving optimal compression without precision loss. To support efficient inference, a custom GPU kernel is developed for fast online decompression, which includes strategies like compacting memory-intensive lookup tables and minimizing latency through transformer-block-level decompression. Experimental results demonstrate that DFloat11 not only reduces model size but also significantly enhances throughput in token generation, achieving 1.9-38.8 times higher performance compared to uncompressed models. Additionally, it allows for longer context lengths within a fixed GPU memory budget and enables lossless inference of large models, such as Llama-3.1-405B, on a single node with multiple GPUs. The authors provide access to their code and models for further research and application.
- DFloat11 reduces LLM size by 30% while preserving accuracy.
- The framework utilizes dynamic-length encodings based on weight frequency.
- It significantly improves throughput in token generation compared to uncompressed models.
- The method allows for longer context lengths within the same GPU memory constraints.
- Lossless inference of large models is achievable on a single node with multiple GPUs.
Related
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]
The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.
Fast LLM Inference From Scratch (using CUDA)
The article outlines the development of a C++ and CUDA-based LLM inference engine, emphasizing optimizations for single-GPU throughput, memory bandwidth importance, and benchmarking against existing engines for improved performance.
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Apple's SeedLM is a post-training compression method for large language models that reduces runtime costs, optimizes compute cycles, and maintains performance at high compression levels without requiring calibration data.
Pushing the Limits of LLM Quantization via the Linearity Theorem
The paper presents advancements in large language model quantization, introducing a linearity theorem, a new data-free method called HIGGS, and improved non-uniform quantization, enhancing accuracy-compression trade-offs on various models.
- Many users highlight the significant practical benefits of enabling lossless inference for large models, particularly for research labs and startups.
- There is a discussion about the efficiency of DFloat11 compared to existing methods, with some users questioning its performance at smaller batch sizes.
- Several comments express excitement about the rapid advancements in machine learning and the potential for further optimizations in hardware.
- Concerns are raised about the specific applicability of DFloat11, particularly regarding its effectiveness with non-BFloat16 models.
- Some users seek clarification on the term "lossless" in the context of DFloat11, questioning its conventional meaning in compression.
Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.
This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.
Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).
* I work with xmad.ai
The context length alone probably makes it worthwhile even if your models fit in memory, but I'm curious if it improves tokens/sec even all on GPU, since in my very amateur understanding LLMs tend to be constrained by memory bandwidth?
I see it mentioned but can’t understand if it’s based on it or different/better…
/s I'll show myself out
>achieving near information-optimal compression without any loss of precision
So perhaps more lossless as in didn't lose perplexity/benchmarks?
In my mind lossless is precisely zero bits lost along the way.
Related
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]
The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.
Fast LLM Inference From Scratch (using CUDA)
The article outlines the development of a C++ and CUDA-based LLM inference engine, emphasizing optimizations for single-GPU throughput, memory bandwidth importance, and benchmarking against existing engines for improved performance.
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Apple's SeedLM is a post-training compression method for large language models that reduces runtime costs, optimizes compute cycles, and maintains performance at high compression levels without requiring calibration data.
Pushing the Limits of LLM Quantization via the Linearity Theorem
The paper presents advancements in large language model quantization, introducing a linearity theorem, a new data-free method called HIGGS, and improved non-uniform quantization, enhancing accuracy-compression trade-offs on various models.