December 13th, 2024

New LLM optimization technique slashes memory costs up to 75%

Sakana AI has developed a technique called "universal transformer memory," reducing memory costs for large language models by 75% while improving task performance and allowing flexible context optimization.

Read original articleLink Icon
CuriosityExcitementSkepticism
New LLM optimization technique slashes memory costs up to 75%

Researchers at Sakana AI have introduced a new optimization technique for large language models (LLMs) called "universal transformer memory," which significantly reduces memory costs by up to 75%. This technique employs neural attention memory models (NAMMs) to enhance the efficiency of LLMs by selectively retaining important information while discarding redundant details from their context. The context window, which is crucial for model performance, can be optimized to improve speed and reduce computational expenses. NAMMs operate on the attention layers of LLMs, determining which tokens to keep or discard based on their relevance. The researchers tested this method on the Meta Llama 3-8B model, demonstrating improved performance on natural language and coding tasks, while also achieving substantial memory savings. The flexibility of NAMMs allows them to be applied across various models without additional training. Furthermore, NAMMs adapt their behavior based on the specific task, optimizing the context for coding by removing non-essential tokens like comments and whitespace, and for natural language tasks by eliminating grammatical redundancies. The researchers have made the code for creating NAMMs publicly available, suggesting that this technique could be beneficial for enterprises dealing with large volumes of data. Future developments may involve integrating NAMMs during the training of LLMs to further enhance their memory capabilities.

- Sakana AI's new technique reduces memory costs for LLMs by up to 75%.

- Universal transformer memory uses neural attention memory models to optimize context retention.

- The method improves performance on various tasks while saving memory.

- NAMMs adapt their behavior based on the specific task requirements.

- The code for creating NAMMs has been released for public use.

AI: What people are saying
The comments on the Sakana AI article reflect a mix of curiosity and skepticism regarding the advancements in memory efficiency for large language models.
  • Some users compare Sakana AI's technique to Microsoft's HeadKV paper, which claims even greater memory reduction.
  • There are questions about the long-term necessity of current power infrastructure for AI data centers given ongoing optimizations.
  • Comments highlight the impressive capability of running advanced language models on less powerful hardware.
  • Users express excitement about the potential for future improvements in machine learning efficiency.
  • Some seek clarification on whether the advancements apply to inference or training processes.
Link Icon 21 comments
By @vlovich123 - 4 months
Wonder how this compares with Microsoft's HeadKV paper [1] which claims a 98% percent reduction in memory while retaining 97% of the performance.

[1] https://arxiv.org/html/2410.19258v3

By @odyssey7 - 4 months
Is it possible that after 3-4 years of performance optimizations, both algorithmic and in hardware efficiency, it will turn out that we didn’t really need all of the nuclear plants we’re currently in the process of setting up to satisfy the power demands of AI data centers?
By @ComputerGuru - 4 months
This only decreases memory cost of input context window, not the memory cost to load and run the models.
By @solarkraft - 4 months
It’s mind bogglingly crazy that language models rivaling ones that used to require huge GPUs with a ton of VRAM to run now run on my upper-mid-range laptop from 4 years ago. At usable speed. Crazy.

I didn’t expect capable language models to be practical/possible to run loyally, much less on hardware I already have.

By @dboreham - 4 months
By @gcanyon - 4 months
Given that the algorithms powering present LLM models hadn't been invented ten years ago, I have to think that they are (potentially) far from optimal.

Brains have gone through millions of iterations where being efficient was a huge driver of success. We should not be surprised if someone finds a new ML method that is both wildly more efficient and wildly more effective.

By @cs702 - 4 months
Very clever, very meta, and it seems to work really well.

The two big take-aways for me are:

* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.

* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.

I've added this to my ever-growing list of things to try.

By @iandanforth - 4 months
Boringness classifier! Pretty cool because this implies the large models already know what is useless and what isn't.
By @lawlessone - 4 months
So it's like a garbage collector for prompts?
By @richwater - 4 months
This is for inference right? Not training?
By @hoc - 4 months
And we finally can sell the unoptimized models as Hires (since I can still read the differences!).
By @aussieguy1234 - 4 months
Most people don't remember absolutely everything, just the important stuff.
By @skellington - 4 months
This only reduces the working memory, not the base model itself?
By @Euphorbium - 4 months
Stop words in extra steps.
By @bamboozled - 4 months
Really exciting news.
By @ironfootnz - 4 months
I'm a big fan of their papers, this one didn't disappoint
By @tharmas - 4 months
Does this mean us plebs can run LLMs on gimped VRAM Nvidia lower end cards?
By @yishanchuan - 4 months
interesting