April 6th, 2025

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Apple's SeedLM is a post-training compression method for large language models that reduces runtime costs, optimizes compute cycles, and maintains performance at high compression levels without requiring calibration data.

Read original article

CuriositySkepticismInterest

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

SeedLM is a new post-training compression method developed by Apple Machine Learning Research aimed at addressing the high runtime costs associated with deploying large language models (LLMs). This technique encodes and compresses model weights using seeds from a pseudo-random generator. By utilizing a Linear Feedback Shift Register (LFSR) during inference, SeedLM generates a random matrix that is combined with compressed coefficients to reconstruct weight blocks. This approach reduces memory access and optimizes compute cycles, enhancing the speed of memory-bound tasks. Unlike existing methods that require calibration data, SeedLM operates in a data-free manner and demonstrates strong generalization across various tasks. Experiments conducted on the Llama3 70B model indicate that SeedLM maintains zero-shot accuracy at 4- and 3-bit compression levels, performing comparably or better than leading methods while matching the performance of FP16 baselines. Additionally, tests on FPGA platforms reveal that the 4-bit SeedLM can achieve nearly a 4x speed-up over FP16 Llama 2/3 models as the model size increases.

- SeedLM compresses LLM weights using pseudo-random generator seeds.

- It reduces memory access and speeds up inference by optimizing compute cycles.

- The method is data-free and generalizes well across diverse tasks.

- Experiments show strong performance retention at high compression levels.

- FPGA tests indicate significant speed improvements over traditional FP16 models.

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

The paper presents TPI-LLM, a system for efficiently running 70B-scale LLMs on low-resource edge devices, reducing memory requirements by 90% and improving latency through tensor parallelism and local data handling.

SmolLM2

SmolLM2 is a new family of lightweight language models from Hugging Face, available in three sizes, trained on 11 trillion tokens, and designed for on-device operation with accessible model weights.

What happens if we remove 50 percent of Llama?

Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.

SepLLM: Accelerate LLMs by Compressing One Segment into One Separator

SepLLM is a framework that enhances Large Language Models by compressing text into special tokens, reducing computational demands, improving inference speed, and achieving over 50% reduction in KV cache usage.

AI: What people are saying

The comments on Apple's SeedLM reveal a mix of skepticism and curiosity about the compression method for large language models.

Some commenters appreciate the innovative approach but question the effectiveness of the compression, noting limitations in quantization and tile size.
There are suggestions for alternative methods, such as combining random matrices with low-rank matrices for better results.
Concerns are raised about the feasibility of using pseudo-random number generators for meaningful data compression.
Several users draw parallels between this compression technique and human knowledge transfer, emphasizing the search for compact representations.
Some commenters express frustration with Apple's broader AI development timeline, contrasting it with the advancements in compression technology.

13 comments

By @visarga - 20 days

Very interesting trick, using a dictionary of basis vectors which are quickly computed from a seed without storage. But the result is the same 3 or 4 bit quantization, with only a slight improvement. Their tiles are small, just 8 or 12 weights, it's why compression doesn't go too far. It would have been great if this trick lowered quantization <1 bit/weight, that would require longer tiles. Wondering what are the limits if we use a larger reservoir of cheap entropy as part of neural net architecture, even in training.

Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.

By @gblargg - 20 days

It sounds like they basically find part of a pseudo-random sequence that is closest to the desired data, then store the random seed and corrections (which are small so take less space).

By @elashri - 20 days

I think it would be better to just link directly to the paper [1]. It is a work by researchers at Apple and Meta.

[1] https://arxiv.org/abs/2410.10714

By @torginus - 20 days

This sound like compression with extra steps.. What makes this technique particular to LLM weights instead of general purpose data?

By @benob - 20 days

A variant I have been thinking of: each parameter matrix (or block) is the sum of a random matrix (generated from a seed) and a low rank matrix (a LoRA). I'd like to experiment training from scratch in that setting.

By @EGreg - 20 days

What did Zuck mean that Llama 4 Behemoth is already the highest performing base model and hasnt even done training yet? What are the benchmarks then?

Does he mean they did pretraining but not fine tuning?

By @htrp - 20 days

Also from October 2024 (https://arxiv.org/abs/2410.10714)

By @_0ffh - 20 days

I suspect an April fools joke.

In general, compression using PRNGs is not a thing. There might be a special exception for this case, but I somewhat doubt it. =)

By @RainyDayTmrw - 20 days

How do you reconcile this with the (I believe) widely accepted idea that you can't meaningfully compress data using offsets into Pi?

By @anshumankmr - 20 days

all this and they can't launch Apple intelligence on schedule :(

By @jlcases - 20 days

This compression approach reminds me of similarities with human knowledge transfer. In both cases, we're looking for compact representations that can reconstruct complex information.

For technical documentation, I'm experimenting with a similar concept: instead of exhaustively documenting every implementation detail, defining a minimal set of principles and architectural decisions that allow "regenerating" the complete understanding.

Current LLMs excel at expanding compressed concepts, but we're still far from finding the optimal balance between explicit knowledge (detailed documentation) and implicit knowledge (patterns and principles). Is anyone working on systems applying similar ideas to technical knowledge management?

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

The paper presents an FPGA-based accelerator for large language models, achieving 14.3-15.8 times speedup and 6.1 times power efficiency, enhancing deployment in resource-constrained environments.

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

SmolLM2

SmolLM2 is a new family of lightweight language models from Hugging Face, available in three sizes, trained on 11 trillion tokens, and designed for on-device operation with accessible model weights.

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Related

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

SmolLM2

What happens if we remove 50 percent of Llama?

SepLLM: Accelerate LLMs by Compressing One Segment into One Separator

Related

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

Serving 70B-Scale LLMs Efficiently on Low-Resource Edge Devices [pdf]

SmolLM2

What happens if we remove 50 percent of Llama?

SepLLM: Accelerate LLMs by Compressing One Segment into One Separator