July 25th, 2024

AI solves IMO problems at silver medal level

Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four out of six International Mathematical Olympiad problems, achieving a silver medalist level, marking a significant milestone in AI mathematical reasoning.

Read original article

CuriositySkepticismExcitement

AI solves IMO problems at silver medal level

Google DeepMind has announced that its AI systems, AlphaProof and AlphaGeometry 2, have achieved a significant milestone by solving four out of six problems from the International Mathematical Olympiad (IMO), reaching a performance level equivalent to a silver medalist. This accomplishment marks the first time AI has matched this standard in the prestigious competition, which has been held annually since 1959 and is recognized as a benchmark for advanced mathematical reasoning capabilities in AI.

AlphaProof utilizes a reinforcement-learning approach to formal mathematical reasoning, while AlphaGeometry 2 is an enhanced version of a geometry-solving system. The AI systems were evaluated by prominent mathematicians, including Prof. Sir Timothy Gowers, who noted the impressive nature of the solutions generated. The problems were translated into formal mathematical language, allowing the AI to process and solve them effectively. AlphaProof successfully tackled two algebra problems and one number theory problem, including the most challenging problem of the competition, while AlphaGeometry 2 solved the geometry problem.

The systems scored a total of 28 points out of a possible 42, just shy of the gold medal threshold. This achievement highlights the potential of AI in mathematical reasoning and its ability to assist mathematicians in exploring complex problems and proofs. DeepMind plans to continue developing these systems and exploring various AI approaches to enhance mathematical reasoning further.

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

AI can beat real university students in exams, study suggests

A study from the University of Reading reveals AI outperforms real students in exams. AI-generated answers scored higher, raising concerns about cheating. Researchers urge educators to address AI's impact on assessments.

AI Revolutionized Protein Science, but Didn't End It

Artificial intelligence, exemplified by AlphaFold2 and its successor AlphaFold3, revolutionized protein science by predicting structures accurately. AI complements but doesn't replace traditional methods, emphasizing collaboration for deeper insights.

"Superhuman" Go AIs still have trouble defending against these simple exploits

Researchers at MIT and FAR AI found vulnerabilities in top AI Go algorithms, allowing humans to defeat AI with unorthodox strategies. Efforts to improve defenses show limited success, highlighting challenges in creating robust AI systems.

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Mathstral is a new 7B model by Mistral AI for math reasoning, with a 32k context window and Apache 2.0 license. It aims to improve common sense in math problem-solving, deployable locally with LlamaEdge and shareable via GaiaNet for customization and integration.

AI: What people are saying

The comments on the article about Google DeepMind's AI systems achieving silver medalist level in the International Mathematical Olympiad reveal a mix of excitement and skepticism regarding the implications of this achievement.

Many commenters express skepticism about the AI's performance, noting that the problems were manually translated into formal language, which significantly alters the nature of the challenge.
There is a discussion about the time taken by the AI to solve problems compared to human competitors, with some arguing that if humans had the same time, they could perform better.
Several users highlight the need for more transparency regarding the AI's reasoning process and the nature of its solutions, questioning whether it truly understands the problems or is merely brute-forcing answers.
Some commenters emphasize the potential of AI in mathematics, suggesting that it could revolutionize the field, while others caution against overhyping its current capabilities.
Concerns are raised about the environmental impact of training such large models and whether the resources spent justify the results achieved.

98 comments

By @Smaug123 - 9 months

So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:

> First, the problems were manually translated into formal mathematical language for our systems to understand.

The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?

By @necovek - 9 months

This is certainly impressive, but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.

But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.

It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?

Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?

By @golol - 9 months

This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.

By @Ericson2314 - 9 months

The lede is a bit buried: they're using Lean!

This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.

Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.

By @michael_nielsen - 9 months

A good brief overview here from Tim Gowers (a Fields Medallist, who participated in the effort), explaining and contextualizing some of the main caveats: https://x.com/wtgowers/status/1816509803407040909

By @signa11 - 9 months

> ... but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.

this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.

suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !

By @fancyfredbot - 9 months

I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.

By @cynicalpeace - 9 months

Machines have been better than humans at chess for decades.

Yet no one cares. Everyone's busy watching Magnus Carlsen.

We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.

This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.

By @thrance - 9 months

Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.

This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...

By @zone411 - 9 months

The best discussion is here: https://leanprover.zulipchat.com/#narrow/stream/219941-Machi...

By @Jun8 - 9 months

Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.

Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.

I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!

By @nopinsight - 9 months

Once Gemini, the LLM, integrates with AlphaProof and AlphaGeometry 2, it might be able to reliably perform logical reasoning. If that's the case, software development might be revolutionized.

"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909

By @adverbly - 9 months

> First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.

By @riku_iki - 9 months

Example of proof from AlphaProof system: https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

By @SJC_Hacker - 9 months

The kicker with some of those math competition problems, there will be problems that reduce to finding all natural numbers for which some statement is true. These are almost always small numbers, less than 100 in most circumstances.

Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.

But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.

By @robinhouston - 9 months

Some more context is provided by Tim Gowers on Twitter [1].

Since I think you need an account to read threads now, here's a transcript:

Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.

It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.

However, that statement needs a bit of qualifying.

The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.

If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.

Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.

As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.

However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.

So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.

That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.

Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.

It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.

But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.

The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".

However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.

1. https://x.com/wtgowers/status/1816509803407040909?s=46

By @amarant - 9 months

This is quite cool! I've found logical reasoning to be one of the biggest weak points of LLMs, nice to see that an alternative approach works better! I've tried to enlist gpt to help me play a android game called 4=10, where you solve simple math problems, and gpt was hilariously terrible at it. It would both break the rules I described, and make math mistakes, such as claiming 6*5-5+8=10

I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.

Also perhaps a LLM could do the translation step that is currently manual?

By @petters - 9 months

The problems were first converted into a formal language. So they were partly solved by the AI

By @nitrobeast - 9 months

Reading into the details, the system is more impressive than the title. 100% of the algebra and geometry problems were solved. The remaining problems are of combinatorial types, which ironically more closely resembles software engineering work.

By @gallerdude - 9 months

Sometimes I wonder if in 100 years, it's going to be surprising to people that computers had a use before AI...

By @StefanBatory - 9 months

Wow, that's absolutely impressive to hear!

Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.

By @_heimdall - 9 months

I'm still unclear whether the system used here is actually reasoning through the process of solving the problem, or brute forcing solutions with reasoning coming in during the mathematical proof of each potential proof.

Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?

By @HL33tibCe7 - 9 months

This is kind of an ideal use-case for AI, because we can say with absolute certainty whether their solution is correct, completely eliminating the problem of hallucination.

By @majikaja - 9 months

It would be nice if on the page they included detailed descriptions of the proofs it came up with, more information about the capabilities of the system and insights into the training process...

If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.

By @seydor - 9 months

We need to up the ante: Getting human-like performance on any task is not impressive in itself, what matters is superhuman, orders of magnitude above. These comparisons with humans in order create impressive sounding titles are disguising the fact that we are still at the stone age of intelligence.

By @zhiQ - 9 months

Coincidentally, I just posted about how well LLMs handle adding long strings of numbers: https://userfriendly.substack.com/p/discover-how-mistral-lar...

By @dan_mctree - 9 months

I'm curious if we'll see a world where computers could solve math problems so easily, that we'll be overwhelmed by all the results and stop caring. The role of humans might change to asking the computer interesting questions that we care about.

By @0xd1r - 9 months

> As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise.

Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.

By @skywhopper - 9 months

Except it didn’t. The problem statements were hand-encoded into a formal language by human experts, and even then only one problem was actually solved within the time limit. So, claiming the work was “silver medal” quality is outright fraudulent.

By @1024core - 9 months

> The system was allowed unlimited time; for some problems it took up to three days. The students were allotted only 4.5 hours per exam.

I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...

By @stonethrowaway - 9 months

It’s like bringing a rocket launcher to a fist fight but I’d like to use these math language models to find gaps in logic when people are making online arguments. It would be an excellent way to verify who has done their homework.

By @PaulHoule - 9 months

See https://en.wikipedia.org/wiki/Automated_Mathematician for an early system that seems similar in some way.

By @arnabgho - 9 months

https://x.com/GoogleDeepMind/status/1816498082860667086

By @nybsjytm - 9 months

To what extent is the training and structure of AlphaProof tailored specifically to IMO-type problems, which typically have short solutions using combinations of a small handful of specific techniques?

(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)

By @osti - 9 months

So they weren't able to solve the combinatorics problem. I'm not super well versed in competition math, but combinatorics always seem to be the most interesting problems to me.

By @lo_fye - 9 months

Remember when people thought computers would never be able to beat a human Grand Master at chess? Ohhh, pre-2000 life, how I miss thee.

By @quirino - 9 months

I honestly expected the IOI (International Olympiad of Informatics) to be "beaten" much earlier than the IMO. There's AlphaCode, of course, but on the latest update I don't think it was quite on "silver medal" level. And available LLM's are probably not even on "honourable mention" level.

I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).

Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.

[1] https://info.atcoder.jp/entry/llm-abc-rules-en

By @ckcheng - 9 months

There doesn’t seem to be much information on how they attempted and failed to solve the combinatorial type problems.

Anyone know any details?

By @pnjunction - 9 months

Brilliant and so encouraging!

>because of limitations in reasoning skills and training data

One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?

By @jerb - 9 months

Is the score of 28 comparable to the score of 29 here? https://www.kaggle.com/competitions/ai-mathematical-olympiad...

By @djaouen - 9 months

Is it really such a smart thing to train a non-human "entity" to beat humans at math?

By @brap - 9 months

Are all of these specialized models available for use? Like, does it have an API?

I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them

By @11101010001100 - 9 months

Can anyone comment on how different the AI generated proofs are when compared to those of humans? Recent chess engines have had some 'different' ideas.

By @imranhou - 9 months

If the system took 3 days to solve a problem, how different is this approach than a bruteforce attempt at the problem with educated guesses? Thats not reasoning in my mind.

By @gowld - 9 months

Why is it so hard to make an AI that can translate an informally specified math problem (and Geometry isn't even so informal) into a formal representation?

By @sssummer - 9 months

Why frontier models can both achieve silver medal in Math Olympiad but also fail to answer "which number is bigger, 9.11 or 9.9"?

By @quantum_state - 9 months

It’s as impressive as if not more than AI beating a chess master. But are we or should we be really impressed?

By @rowanG077 - 9 months

Is this just google blowing up their own asses or is this actually useable with some sane license?

By @c0l0 - 9 months

That's great, but does that particular model also know if/when/that it does not know?

By @atum47 - 9 months

Oh, the title was changed to international math Olympiad. I was reading IMO as in my opinion, haha

By @fovc - 9 months

6 months ago I predicted Algebra would be next after geometry. Nice to see that was right. I thought number theory would come before combinatorics, but this seems to have solved one of those. Excited to dig into how it was done

https://news.ycombinator.com/item?id=39037512

By @lumb63 - 9 months

Can someone explain why proving and math problem solving is not a far easier problem for computers? Why does it require any “artificial intelligence” at all?

For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.

What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?

By @mathinaly - 9 months

How do they know their formalization of the informal problems into formal ones was correct?

By @__0x01 - 9 months

Please could someone explain, very simply, what the training data was composed of?

By @m3kw9 - 9 months

Is it one of those slowly slowly then suddenly things? I hope so

By @mupuff1234 - 9 months

Can it / did it solve problems that weren't solved yet?

By @amelius - 9 months

How long until this tech is integrated into compilers?

By @dmitrygr - 9 months

> First, the problems were manually translated into formal mathematical language

That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"

By @refulgentis - 9 months

Goalposts at the moon, FUD at "but what if its obviously fake?".

Real, exact, quotes from the top comments at 1 PM EST.

"I want to believe that the computer found it, but I can't find anything to confirm."

"Curing cancer will require new ideas"

"Maybe they used 10% of all of GCP [Google compute]"

By @gerdesj - 9 months

Why on earth did the "beastie" need the questions translating?

So it failed at the first step (comprehension) and hence I think we can request a better effort next time.

By @badrunaway - 9 months

This will in a few months change everything forever. Exponential growth incoming soon from Deepmind systems.

By @szundi - 9 months

Like it understands any of it

By @thoiwer23423 - 9 months

And yet it thinks 3.11 is greater than 3.9

(probably confused by version numbers)

By @mik09 - 9 months

how long before it solves the last two problems?

By @lolinder - 9 months

This is a fun result for AI, but a very disingenuous way to market it.

IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].

This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.

If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?

Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.

[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf

By @gyudin - 9 months

Haha, what a dumb tincan (c) somebody on Twitter right now :D

By @hendler - 9 months

AI solves IMO problems at silver medal level

Related

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

AI can beat real university students in exams, study suggests

AI Revolutionized Protein Science, but Didn't End It

"Superhuman" Go AIs still have trouble defending against these simple exploits

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Related

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

AI can beat real university students in exams, study suggests

AI Revolutionized Protein Science, but Didn't End It

"Superhuman" Go AIs still have trouble defending against these simple exploits

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?