January 13th, 2025

Titans: Learning to Memorize at Test Time

The "Titans" paper presents a neural memory module that enhances attention mechanisms, outperforming Transformers and linear models in tasks requiring large context windows, achieving higher accuracy in various applications.

Read original articleLink Icon
Titans: Learning to Memorize at Test Time

The paper titled "Titans: Learning to Memorize at Test Time" introduces a novel neural long-term memory module designed to enhance the performance of attention mechanisms in machine learning models. Traditional recurrent models compress data into a fixed-size memory, while attention mechanisms capture dependencies across the entire context window but are limited by quadratic costs, restricting context length. The proposed Titans architecture combines short-term attention with long-term memory capabilities, allowing for effective utilization of historical context during inference. The authors present three variants of the Titans architecture, demonstrating its effectiveness in various tasks, including language modeling, common-sense reasoning, genomics, and time series analysis. Experimental results indicate that Titans outperform both Transformers and modern linear recurrent models, achieving higher accuracy in tasks requiring large context windows exceeding 2 million tokens. This advancement suggests a significant improvement in handling complex dependencies and scaling in machine learning applications.

- The Titans architecture integrates short-term attention and long-term memory for improved performance.

- It allows for effective utilization of historical context during inference.

- Experimental results show Titans outperform Transformers and linear recurrent models.

- The architecture can handle context windows larger than 2 million tokens.

- Titans demonstrate higher accuracy in complex tasks requiring extensive context.

Link Icon 2 comments
By @gwern - 3 months
By @cs702 - 3 months
Interesting. I like the idea of a meta-mechanism that learns to update an associative memory based on how surprising the data is. The other stuff, reading memory via keys and values and selectively erasing it with gating, look pretty conventional on a first glance. Thank you for sharing this on HN. I've added it to my reading list.

EDIT: I'm reminded of this other type of associative memory: https://github.com/glassroom/heinsen_routing. The idea there is to compute a mixture of memories that best predicts the given input sequence. Quite frankly, I don't remember how the whole thing works, but I do remember that it works. It's been a while since I used it, so YMMV. In any case, it may be of interest to you.