December 13th, 2024

New LLM optimization technique slashes memory costs up to 75%

Sakana AI has developed a technique called "universal transformer memory," reducing memory costs for large language models by 75% while improving task performance and allowing flexible context optimization.

Read original article

CuriosityExcitementSkepticism

New LLM optimization technique slashes memory costs up to 75%

Researchers at Sakana AI have introduced a new optimization technique for large language models (LLMs) called "universal transformer memory," which significantly reduces memory costs by up to 75%. This technique employs neural attention memory models (NAMMs) to enhance the efficiency of LLMs by selectively retaining important information while discarding redundant details from their context. The context window, which is crucial for model performance, can be optimized to improve speed and reduce computational expenses. NAMMs operate on the attention layers of LLMs, determining which tokens to keep or discard based on their relevance. The researchers tested this method on the Meta Llama 3-8B model, demonstrating improved performance on natural language and coding tasks, while also achieving substantial memory savings. The flexibility of NAMMs allows them to be applied across various models without additional training. Furthermore, NAMMs adapt their behavior based on the specific task, optimizing the context for coding by removing non-essential tokens like comments and whitespace, and for natural language tasks by eliminating grammatical redundancies. The researchers have made the code for creating NAMMs publicly available, suggesting that this technique could be beneficial for enterprises dealing with large volumes of data. Future developments may involve integrating NAMMs during the training of LLMs to further enhance their memory capabilities.

- Sakana AI's new technique reduces memory costs for LLMs by up to 75%.

- Universal transformer memory uses neural attention memory models to optimize context retention.

- The method improves performance on various tasks while saving memory.

- NAMMs adapt their behavior based on the specific task requirements.

- The code for creating NAMMs has been released for public use.

Memory^3: Language Modeling with Explicit Memory

The paper introduces Memory^3, a novel approach for large language models, using explicit memory to reduce training costs. It outperforms traditional models, emphasizing knowledge externalization and innovative techniques for memory enhancement.

Llama 3.1: Our most capable models to date

Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.

The Role of Anchor Tokens in Self-Attention Networks

The paper "Anchor-based Large Language Models" presents AnLLMs, which enhance efficiency by compressing sequence information, achieving up to 99% cache reduction and 3.5 times faster inference, while maintaining accuracy.

What happens if we remove 50 percent of Llama?

Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.

Meta unveils a new, more efficient Llama model

Meta has launched the Llama 3.3 70B generative AI model, outperforming competitors while reducing costs. The company is investing $10 billion in AI infrastructure amid regulatory challenges in the EU.

AI: What people are saying

The comments on the Sakana AI article reflect a mix of curiosity and skepticism regarding the advancements in memory efficiency for large language models.

Some users compare Sakana AI's technique to Microsoft's HeadKV paper, which claims even greater memory reduction.
There are questions about the long-term necessity of current power infrastructure for AI data centers given ongoing optimizations.
Comments highlight the impressive capability of running advanced language models on less powerful hardware.
Users express excitement about the potential for future improvements in machine learning efficiency.
Some seek clarification on whether the advancements apply to inference or training processes.

21 comments

By @vlovich123 - 4 months

Wonder how this compares with Microsoft's HeadKV paper [1] which claims a 98% percent reduction in memory while retaining 97% of the performance.

[1] https://arxiv.org/html/2410.19258v3

By @odyssey7 - 4 months

Is it possible that after 3-4 years of performance optimizations, both algorithmic and in hardware efficiency, it will turn out that we didn’t really need all of the nuclear plants we’re currently in the process of setting up to satisfy the power demands of AI data centers?

By @ComputerGuru - 4 months

This only decreases memory cost of input context window, not the memory cost to load and run the models.

By @solarkraft - 4 months

It’s mind bogglingly crazy that language models rivaling ones that used to require huge GPUs with a ton of VRAM to run now run on my upper-mid-range laptop from 4 years ago. At usable speed. Crazy.

I didn’t expect capable language models to be practical/possible to run loyally, much less on hardware I already have.

By @dboreham - 4 months

TFP: https://arxiv.org/abs/2410.13166

By @gcanyon - 4 months

Given that the algorithms powering present LLM models hadn't been invented ten years ago, I have to think that they are (potentially) far from optimal.

Brains have gone through millions of iterations where being efficient was a huge driver of success. We should not be surprised if someone finds a new ML method that is both wildly more efficient and wildly more effective.

By @cs702 - 4 months

Very clever, very meta, and it seems to work really well.

The two big take-aways for me are:

* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.

* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.

I've added this to my ever-growing list of things to try.

By @iandanforth - 4 months

Boringness classifier! Pretty cool because this implies the large models already know what is useless and what isn't.

By @lawlessone - 4 months

So it's like a garbage collector for prompts?

By @richwater - 4 months

This is for inference right? Not training?

By @hoc - 4 months

And we finally can sell the unoptimized models as Hires (since I can still read the differences!).

By @aussieguy1234 - 4 months

Most people don't remember absolutely everything, just the important stuff.

By @skellington - 4 months

This only reduces the working memory, not the base model itself?

By @Euphorbium - 4 months

Stop words in extra steps.

By @bamboozled - 4 months

Really exciting news.

By @ironfootnz - 4 months

I'm a big fan of their papers, this one didn't disappoint

By @tharmas - 4 months

Does this mean us plebs can run LLMs on gimped VRAM Nvidia lower end cards?

By @yishanchuan - 4 months

interesting

New LLM optimization technique slashes memory costs up to 75%

Related

Memory^3: Language Modeling with Explicit Memory

Llama 3.1: Our most capable models to date

The Role of Anchor Tokens in Self-Attention Networks

What happens if we remove 50 percent of Llama?

Meta unveils a new, more efficient Llama model

Related

Memory^3: Language Modeling with Explicit Memory

Llama 3.1: Our most capable models to date

The Role of Anchor Tokens in Self-Attention Networks

What happens if we remove 50 percent of Llama?

Meta unveils a new, more efficient Llama model