New LLM optimization technique slashes memory costs up to 75%
Sakana AI has developed a technique called "universal transformer memory," reducing memory costs for large language models by 75% while improving task performance and allowing flexible context optimization.
Read original articleResearchers at Sakana AI have introduced a new optimization technique for large language models (LLMs) called "universal transformer memory," which significantly reduces memory costs by up to 75%. This technique employs neural attention memory models (NAMMs) to enhance the efficiency of LLMs by selectively retaining important information while discarding redundant details from their context. The context window, which is crucial for model performance, can be optimized to improve speed and reduce computational expenses. NAMMs operate on the attention layers of LLMs, determining which tokens to keep or discard based on their relevance. The researchers tested this method on the Meta Llama 3-8B model, demonstrating improved performance on natural language and coding tasks, while also achieving substantial memory savings. The flexibility of NAMMs allows them to be applied across various models without additional training. Furthermore, NAMMs adapt their behavior based on the specific task, optimizing the context for coding by removing non-essential tokens like comments and whitespace, and for natural language tasks by eliminating grammatical redundancies. The researchers have made the code for creating NAMMs publicly available, suggesting that this technique could be beneficial for enterprises dealing with large volumes of data. Future developments may involve integrating NAMMs during the training of LLMs to further enhance their memory capabilities.
- Sakana AI's new technique reduces memory costs for LLMs by up to 75%.
- Universal transformer memory uses neural attention memory models to optimize context retention.
- The method improves performance on various tasks while saving memory.
- NAMMs adapt their behavior based on the specific task requirements.
- The code for creating NAMMs has been released for public use.
Related
Memory^3: Language Modeling with Explicit Memory
The paper introduces Memory^3, a novel approach for large language models, using explicit memory to reduce training costs. It outperforms traditional models, emphasizing knowledge externalization and innovative techniques for memory enhancement.
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
The Role of Anchor Tokens in Self-Attention Networks
The paper "Anchor-based Large Language Models" presents AnLLMs, which enhance efficiency by compressing sequence information, achieving up to 99% cache reduction and 3.5 times faster inference, while maintaining accuracy.
What happens if we remove 50 percent of Llama?
Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.
Meta unveils a new, more efficient Llama model
Meta has launched the Llama 3.3 70B generative AI model, outperforming competitors while reducing costs. The company is investing $10 billion in AI infrastructure amid regulatory challenges in the EU.
- Some users compare Sakana AI's technique to Microsoft's HeadKV paper, which claims even greater memory reduction.
- There are questions about the long-term necessity of current power infrastructure for AI data centers given ongoing optimizations.
- Comments highlight the impressive capability of running advanced language models on less powerful hardware.
- Users express excitement about the potential for future improvements in machine learning efficiency.
- Some seek clarification on whether the advancements apply to inference or training processes.
I didn’t expect capable language models to be practical/possible to run loyally, much less on hardware I already have.
Brains have gone through millions of iterations where being efficient was a huge driver of success. We should not be surprised if someone finds a new ML method that is both wildly more efficient and wildly more effective.
The two big take-aways for me are:
* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.
* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.
I've added this to my ever-growing list of things to try.
Related
Memory^3: Language Modeling with Explicit Memory
The paper introduces Memory^3, a novel approach for large language models, using explicit memory to reduce training costs. It outperforms traditional models, emphasizing knowledge externalization and innovative techniques for memory enhancement.
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
The Role of Anchor Tokens in Self-Attention Networks
The paper "Anchor-based Large Language Models" presents AnLLMs, which enhance efficiency by compressing sequence information, achieving up to 99% cache reduction and 3.5 times faster inference, while maintaining accuracy.
What happens if we remove 50 percent of Llama?
Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.
Meta unveils a new, more efficient Llama model
Meta has launched the Llama 3.3 70B generative AI model, outperforming competitors while reducing costs. The company is investing $10 billion in AI infrastructure amid regulatory challenges in the EU.