March 1st, 2025

Infinite Retrieval: Attention enhanced LLMs in long-context processing

The paper presents InfiniRetri, a method enhancing LLMs' retrieval for infinite-length inputs, achieving 100% accuracy in the NIH test and improving performance by up to 288% on benchmarks.

Read original article

The paper titled "Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing" addresses the limitations of Large Language Models (LLMs) regarding their context window size, which poses challenges in handling tasks that exceed this limit. The authors, Xiaoju Ye, Zhichun Wang, and Jingyuan Wang, propose a novel method called InfiniRetri that utilizes the attention distribution of LLMs to enhance retrieval capabilities for inputs of infinite length. Their experiments demonstrate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack (NIH) test with over 1 million tokens using a model with 0.5 billion parameters, outperforming existing methods and larger models. Additionally, the method shows significant performance improvements on real-world benchmarks, with enhancements up to 288%. InfiniRetri can be applied to any Transformer-based LLM without requiring additional training, and it effectively reduces inference latency and computational overhead when processing long texts. The findings suggest that InfiniRetri has substantial potential for practical applications in information retrieval using LLMs' inherent capabilities.

- InfiniRetri enhances LLMs' retrieval capabilities for infinite-length inputs.

- The method achieves 100% accuracy in the NIH test with a 0.5B parameter model.

- It shows performance improvements of up to 288% on real-world benchmarks.

- InfiniRetri can be applied without additional training on Transformer-based LLMs.

- The approach reduces inference latency and computational overhead in long text processing.

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.

The Role of Anchor Tokens in Self-Attention Networks

The paper "Anchor-based Large Language Models" presents AnLLMs, which enhance efficiency by compressing sequence information, achieving up to 99% cache reduction and 3.5 times faster inference, while maintaining accuracy.

New LLM optimization technique slashes memory costs up to 75%

Sakana AI has developed a technique called "universal transformer memory," reducing memory costs for large language models by 75% while improving task performance and allowing flexible context optimization.

Why AI language models choke on too much text

Large language models have improved their context windows but still face challenges with extensive text. Current systems use retrieval-augmented generation, while research aims to enhance attention efficiency in LLMs.

5 comments

By @briancleland - about 1 month

This paper highlights something that should have been obvious: prediction and retrieval are two sides of the same coin. To predict effectively, you must first identify what's relevant. What's remarkable is that a 0.5B parameter model can perform perfect retrieval over 1M tokens when its natural attention patterns are leveraged properly.

It raises an interesting question: what if we designed architectures explicitly around retrieval capabilities? Transformer architectures were designed for prediction, and retrieval emerged as a byproduct. What would an architecture optimized specfically for retrieval look like?

A lot of money has been spent on building out large-scale RAG systems. If the performance improvements promised by the paper are real, the ramifications will be huge. Exciting to see that the authors are promising to release their code - it will be fun to how this model performs on consumer hardware.

By @vignesh865 - about 1 month

I read through the paper, and I found the insights to be excellent.

However, regarding the practical implementation, the paper assumes that the questions will be available in advance. For each question, it requires calculating attention scores between the question and the context chunks, which makes it impractical as a replacement for Retrieval-Augmented Generation (RAG). For instance, if there are 1,000 documents, each with 10 chunks, it would be infeasible to compute attention scores between 10,000 chunks and a user query every time.

By @riddelln - about 1 month

Am I correct in thinking that RAG, or SFT, would still be needed to introduce unseen context to the model.

By @maalouli - about 1 month

Using attention for the retrieval of relevant information seems super intuitive. Only feed the model what it deems relevant. Curious about the scenarios where this mechanism misses relevant information.

By @smallnix - about 1 month

Do I understand right this requires access to internals of the LLM and can not be used with todays models behind an API like ChatGPT or Claude?

Infinite Retrieval: Attention enhanced LLMs in long-context processing

Related

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The Role of Anchor Tokens in Self-Attention Networks

New LLM optimization technique slashes memory costs up to 75%

Why AI language models choke on too much text

Related

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The Role of Anchor Tokens in Self-Attention Networks

New LLM optimization technique slashes memory costs up to 75%

Why AI language models choke on too much text