SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Apple's SeedLM is a post-training compression method for large language models that reduces runtime costs, optimizes compute cycles, and maintains performance at high compression levels without requiring calibration data.
Read original articleSeedLM is a new post-training compression method developed by Apple Machine Learning Research aimed at addressing the high runtime costs associated with deploying large language models (LLMs). This technique encodes and compresses model weights using seeds from a pseudo-random generator. By utilizing a Linear Feedback Shift Register (LFSR) during inference, SeedLM generates a random matrix that is combined with compressed coefficients to reconstruct weight blocks. This approach reduces memory access and optimizes compute cycles, enhancing the speed of memory-bound tasks. Unlike existing methods that require calibration data, SeedLM operates in a data-free manner and demonstrates strong generalization across various tasks. Experiments conducted on the Llama3 70B model indicate that SeedLM maintains zero-shot accuracy at 4- and 3-bit compression levels, performing comparably or better than leading methods while matching the performance of FP16 baselines. Additionally, tests on FPGA platforms reveal that the 4-bit SeedLM can achieve nearly a 4x speed-up over FP16 Llama 2/3 models as the model size increases.
- SeedLM compresses LLM weights using pseudo-random generator seeds.
- It reduces memory access and speeds up inference by optimizing compute cycles.
- The method is data-free and generalizes well across diverse tasks.
- Experiments show strong performance retention at high compression levels.
- FPGA tests indicate significant speed improvements over traditional FP16 models.
Related
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]
The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.
SmolLM2
SmolLM2 is a new family of lightweight language models from Hugging Face, available in three sizes, trained on 11 trillion tokens, and designed for on-device operation with accessible model weights.
What happens if we remove 50 percent of Llama?
Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.
SepLLM: Accelerate LLMs by Compressing One Segment into One Separator
SepLLM is a framework that enhances Large Language Models by compressing text into special tokens, reducing computational demands, improving inference speed, and achieving over 50% reduction in KV cache usage.
- Some commenters appreciate the innovative approach but question the effectiveness of the compression, noting limitations in quantization and tile size.
- There are suggestions for alternative methods, such as combining random matrices with low-rank matrices for better results.
- Concerns are raised about the feasibility of using pseudo-random number generators for meaningful data compression.
- Several users draw parallels between this compression technique and human knowledge transfer, emphasizing the search for compact representations.
- Some commenters express frustration with Apple's broader AI development timeline, contrasting it with the advancements in compression technology.
Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.
Does he mean they did pretraining but not fine tuning?
In general, compression using PRNGs is not a thing. There might be a special exception for this case, but I somewhat doubt it. =)
For technical documentation, I'm experimenting with a similar concept: instead of exhaustively documenting every implementation detail, defining a minimal set of principles and architectural decisions that allow "regenerating" the complete understanding.
Current LLMs excel at expanding compressed concepts, but we're still far from finding the optimal balance between explicit knowledge (detailed documentation) and implicit knowledge (patterns and principles). Is anyone working on systems applying similar ideas to technical knowledge management?
Related
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs
The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.
Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]
The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.
SmolLM2
SmolLM2 is a new family of lightweight language models from Hugging Face, available in three sizes, trained on 11 trillion tokens, and designed for on-device operation with accessible model weights.
What happens if we remove 50 percent of Llama?
Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.
SepLLM: Accelerate LLMs by Compressing One Segment into One Separator
SepLLM is a framework that enhances Large Language Models by compressing text into special tokens, reducing computational demands, improving inference speed, and achieving over 50% reduction in KV cache usage.