March 1st, 2025

Infinite Retrieval: Attention enhanced LLMs in long-context processing

The paper presents InfiniRetri, a method enhancing LLMs' retrieval for infinite-length inputs, achieving 100% accuracy in the NIH test and improving performance by up to 288% on benchmarks.

Read original articleLink Icon
Infinite Retrieval: Attention enhanced LLMs in long-context processing

The paper titled "Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing" addresses the limitations of Large Language Models (LLMs) regarding their context window size, which poses challenges in handling tasks that exceed this limit. The authors, Xiaoju Ye, Zhichun Wang, and Jingyuan Wang, propose a novel method called InfiniRetri that utilizes the attention distribution of LLMs to enhance retrieval capabilities for inputs of infinite length. Their experiments demonstrate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack (NIH) test with over 1 million tokens using a model with 0.5 billion parameters, outperforming existing methods and larger models. Additionally, the method shows significant performance improvements on real-world benchmarks, with enhancements up to 288%. InfiniRetri can be applied to any Transformer-based LLM without requiring additional training, and it effectively reduces inference latency and computational overhead when processing long texts. The findings suggest that InfiniRetri has substantial potential for practical applications in information retrieval using LLMs' inherent capabilities.

- InfiniRetri enhances LLMs' retrieval capabilities for infinite-length inputs.

- The method achieves 100% accuracy in the NIH test with a 0.5B parameter model.

- It shows performance improvements of up to 288% on real-world benchmarks.

- InfiniRetri can be applied without additional training on Transformer-based LLMs.

- The approach reduces inference latency and computational overhead in long text processing.

Link Icon 5 comments
By @briancleland - about 1 month
This paper highlights something that should have been obvious: prediction and retrieval are two sides of the same coin. To predict effectively, you must first identify what's relevant. What's remarkable is that a 0.5B parameter model can perform perfect retrieval over 1M tokens when its natural attention patterns are leveraged properly.

It raises an interesting question: what if we designed architectures explicitly around retrieval capabilities? Transformer architectures were designed for prediction, and retrieval emerged as a byproduct. What would an architecture optimized specfically for retrieval look like?

A lot of money has been spent on building out large-scale RAG systems. If the performance improvements promised by the paper are real, the ramifications will be huge. Exciting to see that the authors are promising to release their code - it will be fun to how this model performs on consumer hardware.

By @vignesh865 - about 1 month
I read through the paper, and I found the insights to be excellent.

However, regarding the practical implementation, the paper assumes that the questions will be available in advance. For each question, it requires calculating attention scores between the question and the context chunks, which makes it impractical as a replacement for Retrieval-Augmented Generation (RAG). For instance, if there are 1,000 documents, each with 10 chunks, it would be infeasible to compute attention scores between 10,000 chunks and a user query every time.

By @riddelln - about 1 month
Am I correct in thinking that RAG, or SFT, would still be needed to introduce unseen context to the model.
By @maalouli - about 1 month
Using attention for the retrieval of relevant information seems super intuitive. Only feed the model what it deems relevant. Curious about the scenarios where this mechanism misses relevant information.
By @smallnix - about 1 month
Do I understand right this requires access to internals of the LLM and can not be used with todays models behind an API like ChatGPT or Claude?