October 14th, 2024

DeepSeek: Advancing theorem proving in LLMs through large-scale synthetic data

The paper presents a method to enhance theorem proving in large language models by generating synthetic proof data. The DeepSeekMath 7B model outperformed GPT-4, proving five benchmark problems.

Read original articleLink Icon
DeepSeek: Advancing theorem proving in LLMs through large-scale synthetic data

The paper titled "DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data" presents a novel approach to enhance theorem proving capabilities in large language models (LLMs) by generating extensive synthetic proof data. The authors, including Huajian Xin and eight others, focus on the limitations of current LLMs in formal theorem proving due to insufficient training data. They propose a method to create a large dataset of Lean 4 proof data derived from high-school and undergraduate mathematical competition problems. This involves translating natural language problems into formal statements, filtering low-quality outputs, and generating proofs. The DeepSeekMath 7B model was fine-tuned on this dataset, which contains 8 million formal statements with proofs. The results showed significant improvements in proof generation accuracy, achieving 46.3% and 52% on specific tests, outperforming the baseline GPT-4 model. Additionally, the model successfully proved five problems in the Lean 4 Formalized International Mathematical Olympiad benchmark, while GPT-4 could not prove any. The authors emphasize the potential of large-scale synthetic data in advancing theorem proving in LLMs and plan to make both the dataset and model publicly available for further research.

- The paper introduces a method for generating synthetic proof data to improve theorem proving in LLMs.

- The DeepSeekMath 7B model achieved higher proof generation accuracy compared to GPT-4.

- The dataset includes 8 million formal statements with proofs derived from mathematical competition problems.

- The model successfully proved five problems in a formal benchmark, demonstrating its effectiveness.

- Both the synthetic dataset and the model will be made available for further research.

Related

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs

The study presents a method to boost Large Language Models' retrieval and reasoning abilities for long-context inputs by fine-tuning on a synthetic dataset. Results show significant improvements in information retrieval and reasoning skills.

LeanDojo: Theorem Proving in Lean Using Language Models

LeanDojo: Theorem Proving in Lean Using Language Models

LeanDojo enhances theorem proving in Lean with retrieval-augmented language models, introducing Lean Copilot for tactic suggestions and proof searches. It includes benchmarks and enables interactive proof state observation, improving automation and discovering new proofs.

Prover-Verifier Games improve legibility of LLM outputs

Prover-Verifier Games improve legibility of LLM outputs

The paper discusses improving the legibility of Large Language Model outputs through a training algorithm inspired by Prover-Verifier Games, enhancing both solution accuracy and human verification capabilities.

Can Large Language Models Understand Symbolic Graphics Programs?

Can Large Language Models Understand Symbolic Graphics Programs?

The study evaluates large language models' understanding of symbolic graphics programs, introducing a benchmark and Symbolic Instruction Tuning to enhance reasoning and instruction-following capabilities in visual content comprehension.

Smaller, Weaker, yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Smaller, Weaker, yet Better: Training LLM Reasoners via Compute-Optimal Sampling

The study compares weaker and stronger language models for generating synthetic data, finding that weaker models offer better coverage and diversity, leading to improved performance in fine-tuned language models.

Link Icon 9 comments
By @youoy - 4 months
From the abstract:

> Proof assistants like Lean have revolutionized mathematical proof verification... > Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

Please don't use "revolutionised", "promising" on a scientific abstract... If it has revolutionised, tell me instead what was the important thing that made it happen. If it is promising, tell me instead why.

Appart from that, the second sentence says:

> Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data.

Is it really? Of hindered by unsuitable model architecture? Because you could perefectly say that Lean is an AI capable of theorem proving... I feel that that sentence should be the potential conclusion of the paper, not the first observation. Like, "we used more training data and it works better, if we scale theorem proving training data, we can have the performance that we want, so training data is actually hindering the advancement in formal theorem proving on LLMs".

By @maxrmk - 4 months
There's a newer version of this model that takes a really cool RL based approach: https://arxiv.org/pdf/2408.08152
By @_flux - 4 months
This must be one of the best applications for LLMs, as you can always automatically verify the results, or reject them otherwise, right?
By @aabhay - 4 months
The ability to use automatic verification + synthetic data is basically common knowledge among practitioners. But all these organizations have also explored endlessly the different ways to overfit on such data and the conclusion is the same -- the current model architecture seems to plateau when it comes to multi-step logical reasoning. You either drift from your common knowledge pre-training too far or you never come up with the right steps in instances where there's a vast design space.

Think -- why has nobody been able to make an LLM play Go better than AlphaZero while still retaining language capabilities? It certainly would have orders of magnitude more parameters.

By @uptownfunk - 4 months
Solutions to major mathematics problems generally require the creation of entirely new objects and the appropriate machinery to analyze them. The behavior of said objects generally ends up providing the solution to the original question as a special case. A new architecture is required but one that is probably not far off from the original transformer in nature.
By @whyowhy3484939 - 4 months
"Suppose you try to construct a coherent, ordered, natural world with no resource other than repeated exposure to things, and the formation of certain associative bonds. Oh, please!"

This is prof. Robinson on Kantian philosophy - check out Oxford podcasts by the way - and this quote is meant to imply that building a coherent world out of raw sensory data and statistics alone is completely and utterly impractical if not outright impossible. While I don't think he meant to refer to any kind of AI, in my mind this description also aptly describes the general method of DL neural networks. Repeated exposure to find correlation.

How does one find order through associativity alone? With AI this is not an academic problem anymore. This has become practical. Kant says it is impossible, not just unlikely.

The Kantian project and the various core issues it tries to address seems readily applicable to AI research yet I see very little mention of it. Perhaps I am just dumb though. Building a mind capable of taming tremendous sensory flux needs to, at the very least, take note of the (many) fundamental issues he raised. Issues I feel are not at all trivial to set aside. I feel we are stuck in Hume's empiricist reasoning and have yet to graduate to Kant and beyond.

Are we now somehow convinced yet again that causality and reasoning will, in fact, after all spontaneously emerge out of pure chaos? Didn't we settle the impossibility of this a few hundred years ago?

By @Havoc - 4 months
Surprised by the negativity here. A 7B class model +- doubles gpt4 score and everyone goes “meh”?!?
By @tikkun - 4 months
[Submitted on 23 May 2024]