September 3rd, 2024

Smaller, Weaker, yet Better: Training LLM Reasoners via Compute-Optimal Sampling

The study compares weaker and stronger language models for generating synthetic data, finding that weaker models offer better coverage and diversity, leading to improved performance in fine-tuned language models.

Read original articleLink Icon
Smaller, Weaker, yet Better: Training LLM Reasoners via Compute-Optimal Sampling

The paper titled "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" explores the effectiveness of using weaker language models (LMs) for generating synthetic data to improve reasoning performance in LMs. The authors, Hritik Bansal and colleagues, investigate whether training on data from a stronger but more expensive model (SE) is more beneficial than using a weaker but cheaper model (WC) under a fixed inference budget. Their research evaluates the generated data based on coverage, diversity, and false positive rates. The findings indicate that while WC models may produce data with higher coverage and diversity, they also have a higher false positive rate. However, LMs fine-tuned on data generated by WC models consistently outperform those trained on SE-generated data across various benchmarks. This challenges the conventional reliance on SE models for synthetic data generation, suggesting that WC models may be a more compute-optimal choice for training advanced reasoning capabilities in LMs.

- The study compares the effectiveness of weaker versus stronger language models in generating synthetic training data.

- Weaker models (WC) may provide better coverage and diversity in data, despite higher false positive rates.

- Fine-tuning on WC-generated data leads to better performance in LMs compared to SE-generated data.

- The findings suggest a shift in practice towards using weaker models for training advanced reasoning capabilities in LMs.

Related

Link Icon 4 comments
By @djoldman - 6 months
I found this to be the key quote:

> Since the 9B model is roughly 3 times smaller than the 27B model, at a fixed sampling compute budget we can sample 3× more sample solutions per problem for Gemma2-9B.

Essentially, training on samples from a weaker LLM is better than 1/3 the samples from a stronger LLM.

By @barelyauser - 6 months
I find that researchers choices of names for the sake of differentiation is more of a barrier than something helpful. Sometimes it feels like I know nothing, but in reality it is the name of the "technique" or phenomena that does not get parsed by my brain.

Things like "Compute-Optimal Sampling" sound just like any other made up gibberish that may or may not exist. Wordings like "memory-centric subsampling", "search based hyper space modeling", "locally induced entropy optimization" don't get parsed. And more often than not after reading such papers, I've come to find out that it is a fancy name for something a toddler knows about. Really disappointing.

By @johndough - 6 months
I thought that the malpractice of starting bar graphs at non-zero values was reserved for dishonest publication formats such as news papers and CPU benchmarks. I did not expect that from a scientific paper by Google DeepMind.