January 15th, 2025

Titans: Learning to Memorize at Test Time

The paper presents the Titans architecture, which integrates short-term attention and long-term memory, enabling larger context windows and outperforming existing models like Transformers in various tasks while supporting faster training.

Read original articleLink Icon
Titans: Learning to Memorize at Test Time

The paper titled "Titans: Learning to Memorize at Test Time" introduces a new neural long-term memory module designed to enhance the performance of attention mechanisms in machine learning models. Traditional recurrent models compress data into a fixed-size memory, while attention mechanisms capture dependencies across the entire context window but are limited by their quadratic cost, restricting context length. The proposed Titans architecture combines short-term attention with long-term memory, allowing for effective utilization of historical context during inference. This approach enables faster parallel training and inference while scaling to context windows larger than 2 million tokens. Experimental results demonstrate that Titans outperform both Transformers and modern linear recurrent models across various tasks, including language modeling, common-sense reasoning, genomics, and time series analysis. The study emphasizes the importance of integrating memory into neural architectures to improve accuracy in complex tasks.

- The Titans architecture combines short-term attention and long-term memory for improved performance.

- It allows for larger context windows, exceeding 2 million tokens.

- Experimental results show Titans outperform existing models like Transformers in multiple tasks.

- The architecture supports fast parallel training and inference.

- The study highlights the significance of memory integration in machine learning models.

Link Icon 9 comments
By @Ratelman - 3 months
So Minimax just "open-sourced" (I add it in "" because they have a custom license for its use and I've not read through that) but they have context length of 4-million tokens and it scored 100% on the needle in a haystack problem. It uses lightning attention - so still attention, just a variation? So this is potentially not as groundbreaking as the publishers of the paper hoped or am I missing something fundamental here? Can this scale better? Does it train more efficiently? The test-time inference is amazing - is that what sets this apart and not necessarily the long context capability? Will it hallucinate a lot less because it stores long-term memory more efficiently and thus won't make up facts but rather use what it has remembered in context?
By @marmaduke - 3 months
similar to RWKV7’s new (sub quadratic) attention mechanism which models key values as v≈kS’ and does an in-context descent on ||v - kS’||^2/2 (where the state matrix S is one attentional head) , explained more by the author here https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-...

and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/

By @amai - 3 months
I wonder why the authors felt they need to use drop caps in this paper. It is a distraction and seems to value style over content.
By @OutOfHere - 3 months
What irks me is when authors only use a needle-in-the-haystack analogy to assess a long context. Humans do a lot more than this when working with a large context. Humans repeatedly go back and forth over parts of the context; it's not a simple one-pass.
By @bansuian - 3 months
From the title I thought this was talking about cramming the night before an exam. ;-) Or if it’s an open book exam learning during the exam as one goes through the textbook.
By @groceryheist - 3 months
Is it just me, or does this seem like big news?
By @suninsight - 3 months
Key questions:

1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.

However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?

2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.

By @minroot - 3 months
How are the references sorted?
By @PunchTornado - 3 months
If this was that good, why would Google release it?