Smaller, Weaker, yet Better: Training LLM Reasoners via Compute-Optimal Sampling
The study compares weaker and stronger language models for generating synthetic data, finding that weaker models offer better coverage and diversity, leading to improved performance in fine-tuned language models.
Read original articleThe paper titled "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" explores the effectiveness of using weaker language models (LMs) for generating synthetic data to improve reasoning performance in LMs. The authors, Hritik Bansal and colleagues, investigate whether training on data from a stronger but more expensive model (SE) is more beneficial than using a weaker but cheaper model (WC) under a fixed inference budget. Their research evaluates the generated data based on coverage, diversity, and false positive rates. The findings indicate that while WC models may produce data with higher coverage and diversity, they also have a higher false positive rate. However, LMs fine-tuned on data generated by WC models consistently outperform those trained on SE-generated data across various benchmarks. This challenges the conventional reliance on SE models for synthetic data generation, suggesting that WC models may be a more compute-optimal choice for training advanced reasoning capabilities in LMs.
- The study compares the effectiveness of weaker versus stronger language models in generating synthetic training data.
- Weaker models (WC) may provide better coverage and diversity in data, despite higher false positive rates.
- Fine-tuning on WC-generated data leads to better performance in LMs compared to SE-generated data.
- The findings suggest a shift in practice towards using weaker models for training advanced reasoning capabilities in LMs.
Related
> Since the 9B model is roughly 3 times smaller than the 27B model, at a fixed sampling compute budget we can sample 3× more sample solutions per problem for Gemma2-9B.
Essentially, training on samples from a weaker LLM is better than 1/3 the samples from a stronger LLM.
Things like "Compute-Optimal Sampling" sound just like any other made up gibberish that may or may not exist. Wordings like "memory-centric subsampling", "search based hyper space modeling", "locally induced entropy optimization" don't get parsed. And more often than not after reading such papers, I've come to find out that it is a fancy name for something a toddler knows about. Really disappointing.