April 6th, 2025

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Apple's SeedLM is a post-training compression method for large language models that reduces runtime costs, optimizes compute cycles, and maintains performance at high compression levels without requiring calibration data.

Read original articleLink Icon
CuriositySkepticismInterest
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

SeedLM is a new post-training compression method developed by Apple Machine Learning Research aimed at addressing the high runtime costs associated with deploying large language models (LLMs). This technique encodes and compresses model weights using seeds from a pseudo-random generator. By utilizing a Linear Feedback Shift Register (LFSR) during inference, SeedLM generates a random matrix that is combined with compressed coefficients to reconstruct weight blocks. This approach reduces memory access and optimizes compute cycles, enhancing the speed of memory-bound tasks. Unlike existing methods that require calibration data, SeedLM operates in a data-free manner and demonstrates strong generalization across various tasks. Experiments conducted on the Llama3 70B model indicate that SeedLM maintains zero-shot accuracy at 4- and 3-bit compression levels, performing comparably or better than leading methods while matching the performance of FP16 baselines. Additionally, tests on FPGA platforms reveal that the 4-bit SeedLM can achieve nearly a 4x speed-up over FP16 Llama 2/3 models as the model size increases.

- SeedLM compresses LLM weights using pseudo-random generator seeds.

- It reduces memory access and speeds up inference by optimizing compute cycles.

- The method is data-free and generalizes well across diverse tasks.

- Experiments show strong performance retention at high compression levels.

- FPGA tests indicate significant speed improvements over traditional FP16 models.

AI: What people are saying
The comments on Apple's SeedLM reveal a mix of skepticism and curiosity about the compression method for large language models.
  • Some commenters appreciate the innovative approach but question the effectiveness of the compression, noting limitations in quantization and tile size.
  • There are suggestions for alternative methods, such as combining random matrices with low-rank matrices for better results.
  • Concerns are raised about the feasibility of using pseudo-random number generators for meaningful data compression.
  • Several users draw parallels between this compression technique and human knowledge transfer, emphasizing the search for compact representations.
  • Some commenters express frustration with Apple's broader AI development timeline, contrasting it with the advancements in compression technology.
Link Icon 13 comments
By @visarga - 20 days
Very interesting trick, using a dictionary of basis vectors which are quickly computed from a seed without storage. But the result is the same 3 or 4 bit quantization, with only a slight improvement. Their tiles are small, just 8 or 12 weights, it's why compression doesn't go too far. It would have been great if this trick lowered quantization <1 bit/weight, that would require longer tiles. Wondering what are the limits if we use a larger reservoir of cheap entropy as part of neural net architecture, even in training.

Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.

By @gblargg - 20 days
It sounds like they basically find part of a pseudo-random sequence that is closest to the desired data, then store the random seed and corrections (which are small so take less space).
By @elashri - 20 days
I think it would be better to just link directly to the paper [1]. It is a work by researchers at Apple and Meta.

[1] https://arxiv.org/abs/2410.10714

By @torginus - 20 days
This sound like compression with extra steps.. What makes this technique particular to LLM weights instead of general purpose data?
By @benob - 20 days
A variant I have been thinking of: each parameter matrix (or block) is the sum of a random matrix (generated from a seed) and a low rank matrix (a LoRA). I'd like to experiment training from scratch in that setting.
By @EGreg - 20 days
What did Zuck mean that Llama 4 Behemoth is already the highest performing base model and hasnt even done training yet? What are the benchmarks then?

Does he mean they did pretraining but not fine tuning?

By @htrp - 20 days
Also from October 2024 (https://arxiv.org/abs/2410.10714)
By @_0ffh - 20 days
I suspect an April fools joke.

In general, compression using PRNGs is not a thing. There might be a special exception for this case, but I somewhat doubt it. =)

By @RainyDayTmrw - 20 days
How do you reconcile this with the (I believe) widely accepted idea that you can't meaningfully compress data using offsets into Pi?
By @anshumankmr - 20 days
all this and they can't launch Apple intelligence on schedule :(
By @jlcases - 20 days
This compression approach reminds me of similarities with human knowledge transfer. In both cases, we're looking for compact representations that can reconstruct complex information.

For technical documentation, I'm experimenting with a similar concept: instead of exhaustively documenting every implementation detail, defining a minimal set of principles and architectural decisions that allow "regenerating" the complete understanding.

Current LLMs excel at expanding compressed concepts, but we're still far from finding the optimal balance between explicit knowledge (detailed documentation) and implicit knowledge (patterns and principles). Is anyone working on systems applying similar ideas to technical knowledge management?