AI solves IMO problems at silver medal level
Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four out of six International Mathematical Olympiad problems, achieving a silver medalist level, marking a significant milestone in AI mathematical reasoning.
Read original articleGoogle DeepMind has announced that its AI systems, AlphaProof and AlphaGeometry 2, have achieved a significant milestone by solving four out of six problems from the International Mathematical Olympiad (IMO), reaching a performance level equivalent to a silver medalist. This accomplishment marks the first time AI has matched this standard in the prestigious competition, which has been held annually since 1959 and is recognized as a benchmark for advanced mathematical reasoning capabilities in AI.
AlphaProof utilizes a reinforcement-learning approach to formal mathematical reasoning, while AlphaGeometry 2 is an enhanced version of a geometry-solving system. The AI systems were evaluated by prominent mathematicians, including Prof. Sir Timothy Gowers, who noted the impressive nature of the solutions generated. The problems were translated into formal mathematical language, allowing the AI to process and solve them effectively. AlphaProof successfully tackled two algebra problems and one number theory problem, including the most challenging problem of the competition, while AlphaGeometry 2 solved the geometry problem.
The systems scored a total of 28 points out of a possible 42, just shy of the gold medal threshold. This achievement highlights the potential of AI in mathematical reasoning and its ability to assist mathematicians in exploring complex problems and proofs. DeepMind plans to continue developing these systems and exploring various AI approaches to enhance mathematical reasoning further.
Related
Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]
The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.
AI can beat real university students in exams, study suggests
A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.
AI Revolutionized Protein Science, but Didn't End It
Artificial intelligence, exemplified by AlphaFold2 and its successor AlphaFold3, revolutionized protein science by predicting structures accurately. AI complements but doesn't replace traditional methods, emphasizing collaboration for deeper insights.
"Superhuman" Go AIs still have trouble defending against these simple exploits
Researchers at MIT and FAR AI found vulnerabilities in top AI Go algorithms, allowing humans to defeat AI with unorthodox strategies. Efforts to improve defenses show limited success, highlighting challenges in creating robust AI systems.
Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?
Mathstral is a new 7B model by Mistral AI for math reasoning, with a 32k context window and Apache 2.0 license. It aims to improve common sense in math problem-solving, deployable locally with LlamaEdge and shareable via GaiaNet for customization and integration.
- Many commenters express skepticism about the AI's performance, noting that the problems were manually translated into formal language, which significantly alters the nature of the challenge.
- There is a discussion about the time taken by the AI to solve problems compared to human competitors, with some arguing that if humans had the same time, they could perform better.
- Several users highlight the need for more transparency regarding the AI's reasoning process and the nature of its solutions, questioning whether it truly understands the problems or is merely brute-forcing answers.
- Some commenters emphasize the potential of AI in mathematics, suggesting that it could revolutionize the field, while others caution against overhyping its current capabilities.
- Concerns are raised about the environmental impact of training such large models and whether the resources spent justify the results achieved.
> First, the problems were manually translated into formal mathematical language for our systems to understand.
The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?
That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.
But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.
It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?
Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?
This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.
Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.
yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.
this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.
suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !
Yet no one cares. Everyone's busy watching Magnus Carlsen.
We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.
This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.
IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.
This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...
Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.
I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!
"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.
Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.
But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.
Since I think you need an account to read threads now, here's a transcript:
Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.
It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.
However, that statement needs a bit of qualifying.
The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.
If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.
Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.
Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.
As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.
However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.
So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.
That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.
Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.
It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.
But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.
The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".
However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.
I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.
Also perhaps a LLM could do the translation step that is currently manual?
Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.
Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?
If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.
Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.
I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...
(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)
I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).
Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.
Anyone know any details?
>because of limitations in reasoning skills and training data
One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?
I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them
For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.
What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?
That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"
Real, exact, quotes from the top comments at 1 PM EST.
"I want to believe that the computer found it, but I can't find anything to confirm."
"Curing cancer will require new ideas"
"Maybe they used 10% of all of GCP [Google compute]"
So it failed at the first step (comprehension) and hence I think we can request a better effort next time.
(probably confused by version numbers)
IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].
This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.
If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?
Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.
[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf
In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while this fancy form of "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]
In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" over cleverly pruned spaces isn't as dumb as the critics would like it to be?
---
[a] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
> In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Why not compare with students who are given 3 days to submit an answer ? /s
Not opensourcing anything.
This is a dead end on which no further research can be built.
It violates pretty much every principle of incremental improvement on which science is based. It's here just for hype, and the 300+ comments prove it.
Got to be kidding me. We are fucked
I don't really find that this impressive. With enough compute you could just do n-of-10,000 LLM generations to "brute force" a difficult problem and you'll get there eventually.
Sounds perfect for a GPT model, with lots of input training data (problem books and solutions).
The proofs of these problems aren't interesting. They were already known before the AI started work.
What's interesting is how the AI found the proof. The only answer we have is "slurped data into a neural network, matched patterns, and did some brute search".
What were the ideas it brainstormed? What were the dead-end paths? What were the "activations" where the problem seemed similar to a certain piece of input, which led to a guess of a step in the solution?
And, how much CO2 was released into earths atmosphere?
Related
Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]
The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.
AI can beat real university students in exams, study suggests
A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.
AI Revolutionized Protein Science, but Didn't End It
Artificial intelligence, exemplified by AlphaFold2 and its successor AlphaFold3, revolutionized protein science by predicting structures accurately. AI complements but doesn't replace traditional methods, emphasizing collaboration for deeper insights.
"Superhuman" Go AIs still have trouble defending against these simple exploits
Researchers at MIT and FAR AI found vulnerabilities in top AI Go algorithms, allowing humans to defeat AI with unorthodox strategies. Efforts to improve defenses show limited success, highlighting challenges in creating robust AI systems.
Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?
Mathstral is a new 7B model by Mistral AI for math reasoning, with a 32k context window and Apache 2.0 license. It aims to improve common sense in math problem-solving, deployable locally with LlamaEdge and shareable via GaiaNet for customization and integration.