January 15th, 2025

Titans: Learning to Memorize at Test Time

The paper presents the Titans architecture, which integrates short-term attention and long-term memory, enabling larger context windows and outperforming existing models like Transformers in various tasks while supporting faster training.

Read original article

The paper titled "Titans: Learning to Memorize at Test Time" introduces a new neural long-term memory module designed to enhance the performance of attention mechanisms in machine learning models. Traditional recurrent models compress data into a fixed-size memory, while attention mechanisms capture dependencies across the entire context window but are limited by their quadratic cost, restricting context length. The proposed Titans architecture combines short-term attention with long-term memory, allowing for effective utilization of historical context during inference. This approach enables faster parallel training and inference while scaling to context windows larger than 2 million tokens. Experimental results demonstrate that Titans outperform both Transformers and modern linear recurrent models across various tasks, including language modeling, common-sense reasoning, genomics, and time series analysis. The study emphasizes the importance of integrating memory into neural architectures to improve accuracy in complex tasks.

- The Titans architecture combines short-term attention and long-term memory for improved performance.

- It allows for larger context windows, exceeding 2 million tokens.

- Experimental results show Titans outperform existing models like Transformers in multiple tasks.

- The architecture supports fast parallel training and inference.

- The study highlights the significance of memory integration in machine learning models.

Memory^3: Language Modeling with Explicit Memory

The paper introduces Memory^3, a novel approach for large language models, using explicit memory to reduce training costs. It outperforms traditional models, emphasizing knowledge externalization and innovative techniques for memory enhancement.

Transformer Explainer: An Interactive Explainer of the Transformer Architecture

The Transformer architecture has transformed AI in text generation, utilizing self-attention and advanced features like layer normalization. The Transformer Explainer tool helps users understand its concepts interactively.

Transformer Explainer

The Transformer architecture has transformed AI in text generation, utilizing self-attention and key components like embedding and Transformer blocks, while advanced features enhance performance and stability.

Running LLMs with 3.3M Context Tokens on a Single GPU

DuoAttention enhances long-context LLM inference by optimizing memory and reducing latency, achieving significant memory savings and acceleration in processing while maintaining accuracy, with implementation code available for research.

New LLM optimization technique slashes memory costs up to 75%

Sakana AI has developed a technique called "universal transformer memory," reducing memory costs for large language models by 75% while improving task performance and allowing flexible context optimization.

9 comments

By @Ratelman - 3 months

So Minimax just "open-sourced" (I add it in "" because they have a custom license for its use and I've not read through that) but they have context length of 4-million tokens and it scored 100% on the needle in a haystack problem. It uses lightning attention - so still attention, just a variation? So this is potentially not as groundbreaking as the publishers of the paper hoped or am I missing something fundamental here? Can this scale better? Does it train more efficiently? The test-time inference is amazing - is that what sets this apart and not necessarily the long context capability? Will it hallucinate a lot less because it stores long-term memory more efficiently and thus won't make up facts but rather use what it has remembered in context?

By @marmaduke - 3 months

similar to RWKV7’s new (sub quadratic) attention mechanism which models key values as v≈kS’ and does an in-context descent on ||v - kS’||^2/2 (where the state matrix S is one attentional head) , explained more by the author here https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-...

and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/

By @amai - 3 months

I wonder why the authors felt they need to use drop caps in this paper. It is a distraction and seems to value style over content.

By @OutOfHere - 3 months

What irks me is when authors only use a needle-in-the-haystack analogy to assess a long context. Humans do a lot more than this when working with a large context. Humans repeatedly go back and forth over parts of the context; it's not a simple one-pass.

By @bansuian - 3 months

From the title I thought this was talking about cramming the night before an exam. ;-) Or if it’s an open book exam learning during the exam as one goes through the textbook.

By @groceryheist - 3 months

Is it just me, or does this seem like big news?

By @suninsight - 3 months

Key questions:

1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.

However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?

2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.

By @minroot - 3 months

How are the references sorted?

By @PunchTornado - 3 months

If this was that good, why would Google release it?

Titans: Learning to Memorize at Test Time

Related

Memory^3: Language Modeling with Explicit Memory

Transformer Explainer: An Interactive Explainer of the Transformer Architecture

Transformer Explainer

Running LLMs with 3.3M Context Tokens on a Single GPU

New LLM optimization technique slashes memory costs up to 75%

Related

Memory^3: Language Modeling with Explicit Memory

Transformer Explainer: An Interactive Explainer of the Transformer Architecture

Transformer Explainer

Running LLMs with 3.3M Context Tokens on a Single GPU

New LLM optimization technique slashes memory costs up to 75%