June 29th, 2024

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.

Read original articleLink Icon
Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The paper discusses a method to enhance the retrieval and reasoning capabilities of Large Language Models (LLMs) when processing long-context inputs. The proposed approach involves fine-tuning LLMs on a synthetic dataset designed for numerical key-value retrieval tasks. Experiments conducted on models like GPT-3.5 Turbo and Mistral 7B show that fine-tuning on this dataset significantly improves the models' information retrieval and reasoning abilities in longer-context scenarios. The study demonstrates a transfer of skills from synthetic to real task evaluations, with notable improvements in performance metrics. Additionally, the research highlights that fine-tuning LLMs on synthetic data can prevent performance drops compared to using other baseline data, particularly in tasks like TriviaQA. Overall, the study emphasizes the potential of fine-tuning on synthetic data to enhance LLM performance on longer-context tasks without compromising general benchmark performance.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Delving into ChatGPT usage in academic writing through excess vocabulary

Delving into ChatGPT usage in academic writing through excess vocabulary

A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.

How to think about creating a dataset for LLM fine-tuning evaluation

How to think about creating a dataset for LLM fine-tuning evaluation

Alex Strick van Linschoten emphasizes objective evaluation of LLM fine-tuning, focusing on accuracy, out-of-domain data, information gradations, spelling variations, and structured data tasks. He plans systematic model comparisons for performance enhancement.

Large Language Models are not a search engine

Large Language Models are not a search engine

Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.

LLMs now write lots of science. Good

LLMs now write lots of science. Good

Large language models (LLMs) are significantly shaping scientific papers, with up to 20% of computer science abstracts and a third in China influenced by them. Debates persist on the impact of LLMs on research quality and progress.

Link Icon 5 comments
By @anotherpaulg - 5 months
This is really interesting. They fine-tune on instances of this sort of task:

  Do a task using the list of dictionaries below.
  Dictionary [1] {122: 765, 4548: 1475, 4818: 4782} Dictionary [2] {526: 290, 9205: 9318, 9278: 1565} ...
  Dictionary [32] {2931: 8364, 196: 1464, 812: 5363} ...
  Dictionary [85] {344: 1579, 116: 617, 330: 411}
  Above is a list of dictionaries such that each key and value is an integer. Report the
  value of key 2931 and the dictionary it is in.
  Desired answer: The value of key 2931 is 8364 and it is in Dictionary [32].

This task doesn't teach any new facts, but seems to encourage better ability to random-access data from a large context.
By @dvt - 5 months
I've seen a lot of papers recently tackle the needle-in-a-haystack problem wrt LLMs, and I think this approach (and more generally, any in-context solution) is a mistake.

Imo the best way to handle this is RAG + multi-shot prompting (+ symbolic mapping to an actual data structure). For example, a pre-processing step where you partition the context by "records," another step where you insert (and potentially split up the records) in a RAG database, and another step where you make fuzzy queries. So, if you ask for record 1234 you get an exact match on that line (or set of lines, or record, or whatever) of the original context. And if you ask for "elephant" but there's no "elephant" in the context, you might get the "hippo" record because of the RAG reranking.

This is a lot of work, and is essentially a data pipeline, but the results are much better-curated than just fine-tuning and hoping that generalized needle-in-a-haystack search will work reliably as part of a language model.

By @kristjansson - 5 months
The comments here are kinda silly… the haystack test measures how well a model can natively attend to its entire context window. Of course a more elaborate pipeline, or a way for the model to use a shell, or whatever will easily (trivially) solve the problem.

But that’s not the point, the point is a task that’s trivial to generate and exercises 10s-100s of thousands of tokens of context in a falsifiable way.

By @yousif_123123 - 5 months
Haven't read the paper yet, but looks like this can improve the ability of the model attention to work better, since many of these tasks end up being similar to these generic tasks.

Even gpt4 gets tripped up when there's too many exact instructions needed to be executed on an input. That's why it's common that breaking a task into multiple steps and multiple improves performance.

It's wonderful to see improvements possible on smaller models.

By @viksit - 5 months
anyone have pointers on progress in symbolic reasoning vs context forcing approaches in LLMs?