August 28th, 2024

LongWriter: Unleashing 10k Word Generation from Long Context LLMs

The paper introduces AgentWrite to enhance long context LLMs' output capacity beyond 20,000 words by using the LongWriter-6k dataset and achieving state-of-the-art performance on the LongBench-Write benchmark.

Read original article

LongWriter: Unleashing 10k Word Generation from Long Context LLMs

The paper titled "LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs" addresses the limitations of current long context large language models (LLMs) in generating outputs longer than 2,000 words, despite their ability to process inputs of up to 100,000 tokens. The authors, Yushi Bai and colleagues, identify that this limitation stems from the lack of long-output examples in existing supervised fine-tuning (SFT) datasets. To overcome this challenge, they introduce AgentWrite, an agent-based pipeline that breaks down ultra-long generation tasks into manageable subtasks, allowing LLMs to produce coherent outputs exceeding 20,000 words. They also present LongWriter-6k, a new dataset with 6,000 SFT examples featuring output lengths from 2,000 to 32,000 words. By integrating this dataset into model training, they successfully enhance the output capacity of existing models to over 10,000 words while preserving quality. Additionally, they develop LongBench-Write, a benchmark for assessing ultra-long generation capabilities. Their 9B parameter model, improved through DPO, achieves state-of-the-art performance on this benchmark, outperforming larger proprietary models. The findings suggest that existing long context LLMs can achieve greater output lengths with appropriate training data.

- Long context LLMs struggle to generate outputs longer than 2,000 words due to limited training data.

- AgentWrite decomposes long generation tasks, enabling outputs over 20,000 words.

- LongWriter-6k dataset includes 6,000 examples with extended output lengths.

- The 9B parameter model achieves state-of-the-art performance on the new LongBench-Write benchmark.

- Enhanced training data can unlock the potential for longer outputs in existing LLMs.

1 comments

By @ngetchell - 8 months

Why? Surely this just clogs up social media and ebook stores, right?

LongWriter: Unleashing 10k Word Generation from Long Context LLMs

Related

Related