April 2nd, 2025

Multi-Token Attention

The "Multi-Token Attention" paper introduces a new attention mechanism for large language models, improving performance in language modeling and information retrieval by conditioning on multiple query and key vectors simultaneously.

Read original articleLink Icon
CuriositySkepticismInterest
Multi-Token Attention

The paper titled "Multi-Token Attention" by Olga Golovneva and colleagues introduces a novel attention mechanism designed to enhance the performance of large language models (LLMs). Traditional soft attention relies on single token attention, where attention weights are determined by the similarity between a single query and key token vector. This approach limits the amount of information utilized for distinguishing relevant context. The proposed Multi-Token Attention (MTA) method addresses this limitation by allowing LLMs to condition attention weights on multiple query and key vectors simultaneously. This is achieved through convolution operations that enable nearby queries and keys to influence each other's attention weights, resulting in a more nuanced understanding of context. The authors demonstrate that MTA significantly improves performance on various benchmarks, particularly in language modeling tasks and scenarios requiring information retrieval from long contexts. The findings suggest that MTA's ability to leverage richer information leads to superior outcomes compared to traditional Transformer models.

- Multi-Token Attention (MTA) enhances attention mechanisms in large language models.

- MTA allows simultaneous conditioning on multiple query and key vectors.

- The method improves performance on language modeling and information retrieval tasks.

- MTA outperforms traditional Transformer models in various benchmarks.

- The approach utilizes convolution operations for more nuanced attention weight determination.

AI: What people are saying
The comments on the "Multi-Token Attention" paper reflect a mix of curiosity, skepticism, and technical discussion regarding the new attention mechanism.
  • Several commenters discuss the integration of convolution operations with attention mechanisms, noting its potential benefits and challenges.
  • There are concerns about the practicality and efficiency of the proposed method, especially regarding compatibility with existing optimized attention libraries.
  • Some users question the necessity of reintroducing local windows in attention, suggesting it may contradict the original purpose of addressing long-range dependencies.
  • Comparisons are made to other models, such as the Byte Latent Transformer, highlighting different approaches to attention and embedding.
  • There is a general interest in moving beyond tokenization to enhance model capabilities, with some advocating for innovative solutions in AI development.
Link Icon 10 comments
By @kouteiheika - 1 day
This is another potential improvement to the transformer architecture from Facebook (the other one that comes to mind is this one from same authors: https://arxiv.org/abs/2405.18719), but note that it comes with a major problem that might not be obvious at first glance: it's just not usable in practice without a ton of work. It modifies the innards of the attention mechanism, so it is incompatible with Flash Attention (or any other optimized attention library), and you do not want to train anything beyond toy models without Flash Attention (the performance hit is just way too big).

There's pytorch's FlexAttention which could maybe make this practical, but currently it's just way too buggy.

By @bionhoward - 2 days
How does this compare with Byte Latent Transformer [1]? This happens with convolution post-embedding while BLT happens with attention at embedding time?

1. https://ai.meta.com/research/publications/byte-latent-transf...

By @bigdict - 2 days
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?
By @bob1029 - 2 days
So, we're proposing a multiplicative increase of something that already scales quadratically with the context size?

I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.

By @rakejake - 1 day
Interesting. So they convolve the k,v, q vectors? I have been trying the opposite.

I have been working on a classification problem on audio data (with context size somewhere between 1000 and 3000 with potential to expand later). I have been experimenting with adding attention onto a CNN for a classification task I have been working on.

I tried training a vanilla transformer but in the sizes that I am aiming for (5-30M parameters), the training is incredibly unstable and doesn't achieve the performance of an LSTM.

So I went back to CNNs which are fast to train but don't achieve the losses of LSTMs (which are much slower to train,and for higher context sizes you get into the vanishing gradient problem). The CNN-GRU hubrid a worked much better, giving me my best result.

The GRU layer I used had a size of 512. For increasing context sizes, I'd have to make the convolutional layers deeper so as not to increase the GRU size too large. Instead, I decided to swap out the GRU with a MultiHeadAttention layer. The results are great - better than the CNN-GRU (my previous best). Plus, for equivalent sizes the model is faster to train though it hogs a lot of memory.

By @fabmilo - 2 days
We have to move past tokenization for the next leap in capabilities. All this work done on tokens, specially in the RL optimization contest, is just local optimization alchemy.
By @cgearhart - 2 days
Why is there an expectation that “nearby” tokens are relevant to increase the information in the similarities? That seems like it would hold true within individual words, but the whole point of attention was to solve long range dependencies. Reintroducing local windows seems like a step backwards to me.
By @curiousfiddler - 2 days
So, why would this extract more semantic meaning than multi-head attention? Isn't the whole point of multiple heads similar to how CNNs use multiple types of filters to extract different semantic relationships?
By @jwilber - 2 days
Achieved by “applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention”

Cool to see convolutions making such a comeback lately in the llm world. See also the recent striped hyena2 architecture, which uses the conv-based hyena operator to great success:

https://arxiv.org/abs/2503.01868

By @antonkar - 2 days
There is a planet-wise eternal 100% safe AI solution that can be a billion dollar startup, too:

Put all the GPUs in cloud/s controlled by international scientists (now you can use your GPU on any device, can earn money by renting it when you don’t need it, nothing changes except you need to be online to us it, but we’ll have 5G and better worldwide. You can develop, sell or release free math-proven safe AI models in this cloud “AI App Store”, etc).

Because the main risk is an AI agent botnet - current GPUs are like nukes that are 100% unprotected - any hacker can make a virus with AI agent component just to steal money, this AI will be not aligned at all, will become a per perpetual and eventually autonomous botnet.