December 23rd, 2024

Can AI do maths yet? Thoughts from a mathematician

OpenAI's o3 scored 25% on the FrontierMath dataset, indicating progress in AI's mathematical capabilities, but experts believe it still lacks the innovative thinking required for advanced mathematics.

Read original article

ConcernSkepticismCuriosity

Can AI do maths yet? Thoughts from a mathematician

OpenAI's new language model, o3, recently scored 25% on the FrontierMath dataset, which consists of challenging mathematical problems designed to have definitive, computable answers. The dataset, curated by Epoch AI, is secretive to prevent language models from training on it directly. The problems are not about proving theorems but rather finding specific numbers, making them difficult even for experienced mathematicians. While o3's performance indicates progress in AI's mathematical capabilities, experts believe it still falls short of the innovative thinking required at advanced levels of mathematics. The mathematician quoted in the article expressed surprise at the model's score, as they expected AI to remain at a lower competency level for some time. The discussion also highlights the distinction between numerical problem-solving and theorem proving, with the latter being a more complex challenge. DeepMind's AlphaProof has shown success in solving International Mathematics Olympiad problems, but the debate continues regarding the grading of AI-generated solutions. The future of AI in mathematics remains uncertain, with ongoing advancements but significant hurdles to overcome before AI can match human mathematicians in creativity and understanding.

- OpenAI's o3 scored 25% on the challenging FrontierMath dataset.

- The dataset consists of problems requiring specific numerical answers, not proofs.

- Experts believe AI is still far from achieving advanced mathematical reasoning.

- DeepMind's AlphaProof has successfully solved some Olympiad problems.

- The distinction between numerical problem-solving and theorem proving is crucial in evaluating AI's capabilities.

AI solves IMO problems at silver medal level

Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four out of six International Mathematical Olympiad problems, achieving a silver medalist level, marking a significant milestone in AI mathematical reasoning.

Google DeepMind's AI systems can now solve complex math problems

Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four of six problems from the International Mathematical Olympiad, achieving a silver medal and marking a significant advancement in AI mathematics capabilities.

New secret math benchmark stumps AI models and PhDs alike

Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.

New secret math benchmark stumps AI models and PhDs alike

Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.

OpenAI O3 breakthrough high score on ARC-AGI-PUB

OpenAI's o3 system scored 75.7% on the ARC-AGI-Pub benchmark, showing improved adaptability but still not classified as AGI. Upcoming challenges may further test its capabilities. Community analysis is encouraged.

AI: What people are saying

The discussion surrounding OpenAI's o3 and its performance on the FrontierMath dataset reveals a mix of optimism and skepticism about AI's mathematical capabilities.

Many commenters acknowledge AI's potential as a tool to enhance human mathematical abilities rather than replace them.
Concerns are raised about the reliability of AI in performing complex mathematical tasks, with several users sharing experiences of AI making fundamental errors.
There is skepticism regarding the validity of the FrontierMath dataset, with some suggesting it may have been compromised.
Commenters express a fear of obsolescence in the field of mathematics due to advancements in AI, while others argue that human creativity and intuition remain irreplaceable.
Overall, the conversation reflects a broader debate about the role of AI in academia and its implications for the future of mathematical research.

44 comments

By @Syzygies - 4 months

"Can AI do math for us" is the canonical wrong question. People want self-driving cars so they can drink and watch TV. We should crave tools that enhance our abilities, as tools have done since prehistoric times.

I'm a research mathematician. In the 1980's I'd ask everyone I knew a question, and flip through the hard bound library volumes of Mathematical Reviews, hoping to recognize something. If I was lucky, I'd get a hit in three weeks.

Internet search has shortened this turnaround. One instead needs to guess what someone else might call an idea. "Broken circuits?" Score! Still, time consuming.

I went all in on ChatGPT after hearing that Terry Tao had learned the Lean 4 proof assistant in a matter of weeks, relying heavily on AI advice. It's clumsy, but a very fast way to get suggestions.

Now, one can hold involved conversations with ChatGPT or Claude, exploring mathematical ideas. AI is often wrong, never knows when it's wrong, but people are like this too. Read how the insurance incidents for self-driving taxis are well below the human incident rates? Talking to fellow mathematicians can be frustrating, and so is talking with AI, but AI conversations go faster and can take place in the middle of the night.

I don't want AI to prove theorems for me, those theorems will be as boring as most of the dreck published by humans. I want AI to inspire bursts of creativity in humans.

By @nebulous1 - 4 months

There was a little more information in that reddit thread. Of the three difficulty tiers, 25% are T1 (easiest) and 50% are T2. Of the five public problems that the author looked at, two were T1 and two were T2. Glazer on reddit described T1 as "IMO/undergraduate problems", but the article author says that they don't consider them to be undergraduate problems. So the LLM is already doing what the author says they would be surprised about.

Also Glazer seemed to regret calling T1 "IMO/undergraduate", and not only because of the disparity between IMO and typical undergraduate. He said that "We bump problems down a tier if we feel the difficulty comes too heavily from applying a major result, even in an advanced field, as a black box, since that makes a problem vulnerable to naive attacks from models"

Also, all of the problems shows to Tao were T3

By @jampekka - 4 months

I just spent a few days trying to figure out some linear algebra with the help of ChatGPT. It's very useful for finding conceptual information from literature (which for a not-professional-mathematician at least can be really hard to find and decipher). But in the actual math it constantly makes very silly errors. E.g. indexing a vector beyond its dimension, trying to do matrix decomposition for scalars and insisting on multiplying matrices with mismatching dimensions.

O1 is a lot better at spotting its errors than 4o but it too still makes a lot of really stupid mistakes. It seems to be quite far from producing results itself consistently without at least a somewhat clueful human doing hand-holding.

By @ivan_ah - 4 months

Yesterday, I saw a thought provoking talk about the future of of "math jobs" assuming automated theory proving becomes more prevalent in the future.

[ (Re)imagining mathematics in a world of reasoning machines by Akshay Venkatesh]

https://www.youtube.com/watch?v=vYCT7cw0ycw [54min]

Abstract: In the coming decades, developments in automated reasoning will likely transform the way that research mathematics is conceptualized and carried out. I will discuss some ways we might think about this. The talk will not be about current or potential abilities of computers to do mathematics—rather I will look at topics such as the history of automation and mathematics, and related philosophical questions.

See discussion at https://news.ycombinator.com/item?id=42465907

By @busyant - 4 months

As someone who has a 18 yo son who wants to study math, this has me (and him) ... worried ... about becoming obsolete?

But I'm wondering what other people think of this analogy.

I used to be a bench scientist (molecular genetics).

There were world class researchers who were more creative than I was. I even had a Nobel Laureate once tell me that my research was simply "dotting 'i's and crossing 't's".

Nevertheless, I still moved the field forward in my own small ways. I still did respectable work.

So, will these LLMs make us completely obsolete? Or will there still be room for those of us who can dot the "i"?--if only for the fact that LLMs don't have infinite time/resources to solve "everything."

I don't know. Maybe I'm whistling past the graveyard.

By @voidhorse - 4 months

Eventually we may produce a collection of problems exhaustive enough that these tools can solve almost any problem that isn't novel in practice, but I doubt that they will ever become general problem solvers capable of what we consider to be reasoning in humans.

Historically, the claim that neural nets were actual models of the human brain and human thinking was always epistemically dubious. It still is. Even as the practical problems of producing better and better algorithms, architectures, and output have been solved, there is no reason to believe a connection between the mechanical model and what happens in organisms has been established. The most important point, in my view, is that all of the representation and interpretation still has to happen outside the computational units. Without human interpreters, none of the AI outputs have any meaning. Unless you believe in determinism and an overseeing god, the story for human beings is much different. AI will not be capable of reason until, like humans, it can develop socio-rational collectivities of meaning that are independent of the human being.

Researchers seemed to have a decent grasp on this in the 90s, but today, everyone seems all too ready to make the same ridiculous leaps as the original creators of neural nets. They did not show, as they claimed, that thinking is reducible to computation. All they showed was that a neural net can realize a boolean function—which is not even logic, since, again, the entire semantic interpretive side of the logic is ignored.

By @seafoamteal - 4 months

I don't have much to opine from an advanced maths perspective, but I'd like to point out a couple examples of where ChatGPT made basic errors in questions I asked it as an undergrad CS student.

1. I asked it to show me the derivation of a formula for the efficiency of Stop-and-Wait ARQ and it seemed to do it, but a day later, I realised that in one of the steps, it just made a term vanish to get to the next step. Obviously, I should have verified more carefully, but when I asked it to spot the mistake in that step, it did the same thing twice more with bs explanations of how the term is absorbed.

2. I asked it to provide me syllogisms that I could practice proving. An overwhelming number of the syllogisms it gave me were inconsistent and did not hold. This surprised me more because syllogisms are about the most structured arguments you can find, having been formalized centuries ago and discussed extensively since then. In this case, asking it to walk step-by-step actually fixed the issue.

Both of these were done on the free plan of ChatGPT, but I can remember if it was 4o or 4.

By @upghost - 4 months

I didn't see anyone else ask this but.. isn't the FrontierMath dataset compromised now? At the very least OpenAI now knows the questions if not the answers. I would expect that the next iteration will "magically" get over 80% on the FrontierMath test. I imagine that experiment was pretty closely monitored.

By @LittleTimothy - 4 months

It's fascinating that this has run into the exact same problem as the Quantum research. Ie, in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer. If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness (ie, you can't just check it was matching the traditional computer's answer.

In the same way ChatGPT scores 25% on this and the question is "How close were those 25% to questions in the training set". Or to put it another way we want to answer the question "Is ChatGPT getting better at applying it's reasoning to out-of-set problems or is it pulling more data into it's training set". Or "Is the test leaking into the training".

Maybe the whole question is academic and it doesn't matter, we solve the entire problem by pulling all human knowledge into the training set and that's a massive benefit. But maybe it implies a limit to how far it can push human knowledge forward.

By @bambax - 4 months

> As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me [that] I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so you moment you make a database of maths questions public, the language models will train on it.

Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".

But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.

Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.

But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.

By @jebarker - 4 months

> I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up.

I wonder what the response of working mathematicians will be to this. If the proofs look credible it might be too tempting to try and validate them, but if there’s a deluge that could be a hug time sync. Imagine if Wiles or Perelman had produced a thousand different proofs for their respective problems.

By @a_petrov - 4 months

I use ChatGPT to help study linear algebra.

It helps me a lot when I feel lost. It's often wrong in the calculations, but it's cool to have a study buddy that doesn't judge you.

If I get blocked with a problem I can't solve, I ask for assistance with my approach.

I enjoy asking ChatGPT about the context behind all that math theory. It's nice to elaborate on that as most of the math books are very lean and provide no applied context.

By @Xcelerate - 4 months

So here's what I'm perplexed about. There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever. These are arithmetic truths about the natural numbers. Can these statements be reached faster in ZFC? Possibly—it's well-known that there exist shorter proofs of true statements in more powerful consistent systems.

But the problem then is that one can suppose there are also true short statements in ZFC which likewise require doubly exponential time to reach via any path. Presburger Arithmetic is decidable whereas ZFC is not, so these statements would require the additional axioms of ZFC for shorter proofs, but I think it's safe to assume such statements exist.

Now let's suppose an AI model can resolve the truth of these short statements quickly. That means one of three things:

1) The AI model can discover doubly exponential length proof paths within the framework of ZFC.

2) There are certain short statements in the formal language of ZFC that the AI model cannot discover the truth of.

3) The AI model operates outside of ZFC to find the truth of statements in the framework of some other, potentially unknown formal system (and for arithmetical statements, the system must necessarily be sound).

How likely are each of these outcomes?

1) is not possible within any coherent, human-scale timeframe.

2) IMO is the most likely outcome, but then this means there are some really interesting things in mathematics that AI cannot discover. Perhaps the same set of things that humans find interesting. Once we have exhausted the theorems with short proofs in ZFC, there will still be an infinite number of short and interesting statements that we cannot resolve.

3) This would be the most bizarre outcome of all. If AI operates in a consistent way outside the framework of ZFC, then that would be equivalent to solving the halting problem for certain (infinite) sets of Turing machine configurations that ZFC cannot solve. That in itself itself isn't too strange (e.g., it might turn out that ZFC lacks an axiom necessary to prove something as simple as the Collatz conjecture), but what would be strange is that it could find these new formal systems efficiently. In other words, it would have discovered an algorithmic way to procure new axioms that lead to efficient proofs of true arithmetic statements. One could also view that as an efficient algorithm for computing BB(n), which obviously we think isn't possible. See Levin's papers on the feasibility of extending PA in a way that leads to quickly discovering more of the halting sequence.

By @swalsh - 4 months

Every profession seems to have a pessimistic view of AI as soon as it starts to make progress in their domain. Denial, Anger, Bargaining, Depression, and Acceptance. Artists seem to be in the depression state, many programmers are still in the denial phase. Pretty solid denial here from a mathematician. o3 was a proof of concept, like every other domain AI enters, it's going to keep getting better.

Society is CLEARLY not ready for what AI's impact is going to be. We've been through change before, but never at this scale and speed. I think Musk/Vivek's DOGE thing is important, our governent has gotten quite large and bureaucratic. But the clock has started on AI, and this is a social structural issue we've gotta figure out. Putting it off means we probably become subjects to a default set of rulers if not the shoggoth itself.

By @aithrowawaycomm - 4 months

I am fairly optimistic about LLMs as a human math -> theorem-prover translator, and as a fan of Idris I am glad that the AI community is investing in Lean. As the author shows, the answer to "Can AI be useful for automated mathematical work?" is clearly "yes."

But I am confident the answer to the question in the headline is "no, not for several decades." It's not just the underwhelming benchmark results discussed in the post, or the general concern about hard undergraduate math using different skillsets than ordinary research math. IMO the deeper problem still seems to be a basic gap where LLMs can seemingly do formal math at the level of a smart graduate student but fail at quantitative/geometric reasoning problems designed for fish. I suspect this holds for O3, based on one of the ARC problems it wasn't able to solve: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... (via https://www.interconnects.ai/p/openais-o3-the-2024-finale-of...) ANNs are simply not able to form abstractions, they can only imitate them via enormous amounts of data and compute. I would say there has been zero progress on "common sense" math in computers since the invention of Lisp: we are still faking it with expert systems, even if LLM expert systems are easier to build at scale with raw data.

It is the same old problem where an ANN can attain superhuman performance on level 1 of Breakout, but it has to be retrained for level 2. I am not convinced it makes sense to say AI can do math if AI doesn't understand what "four" means with the same depth as a rat, even if it can solve sophisticated modular arithmetic problems. In human terms, does it make sense to say a straightedge-and-compass AI understands Euclidean geometry if it's not capable of understanding the physical intuition behind Euclid's axioms? It makes more sense to say it's a brainless tool that helps with the tedium and drudgery of actually proving things in mathematics.

By @4ad - 4 months

> FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month.

The database stopped being secret when it was fed to proprietary LLMs running in the cloud. If anyone is not thinking that OpenAI has trained and tuned O3 on the "secret" problems people fed to GPT-4o, I have a bridge to sell you.

By @charlieyu1 - 4 months

One thing I know is that there wouldn’t be machines entering IMO 2025. The concept of “marker” does not exist in IMO - scores are decided by negotiations between team leaders of each country and the juries. It is important to get each team leader involved for grading the work of students for their country, for accountability as well as acknowledging cultural differences. And the hundreds of people are not going to stay longer to grade AI work.

By @aomix - 4 months

No comment on the article it's just always interesting to get hit with intense jargon from a field I know very little about.

I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive)

By @ned99 - 4 months

I think this is a silly question, you could track AI's doing very simple maths back in 1960 - 1970's

By @retrocryptid - 4 months

When did we decide that AI == LLM? Oh don't answer. I know, The VC world noticed CNNs and LLMs about 10 years ago and it's the only thing anyone's talked about ever since.

Seems to me the answer to 'Can AI do maths yet?' depends on what you call AI and what you call maths. Our old departmental VAX running at a handfull of megahertz could do some very clever symbol manipulation on binomials and if you gave it a few seconds, it could even do something like theorum proving via proto-prolog. Neither are anywhere close to the glorious GAI future we hope to sell to industry and government, but it seems worth considering how they're different, why they worked, and whether there's room for some hybrid approach. Do LLMs need to know how to do math if they know how to write Prolog or Coc statements that can do interesting things?

I've heard people say they want to build software that emulates (simulates?) how humans do arithmetic, but ask a human to add anything bigger than two digit numbers and the first thing they do is reach for a calculator.

By @caroline0v0 - 4 months

In fact, what I am most curious about is how AI understands symbolic logic relationships (neural networks and Turing machines are not completely equivalent). During training, this is a bunch of tokens.

By @Sparkyte - 4 months

After playing with and using AI for almost two years now it is not getting better from both a cost perspective and performance.

So the higher the cost the better the performance. While models and hardware can be improved the curve is still steep.

The big answer is what are people using it for? We'll they are using lightweight simplistic models to do targeted tasks. To do many smaller and easier to process tasks.

Most of the news on AI is just there to promote a product to earn more cash.

By @YeGoblynQueenne - 4 months

>> There were language models before ChatGPT, and on the whole they couldn’t even write coherent sentences and paragraphs. ChatGPT was really the first public model which was coherent.

If that's referring to Large Language Models, meaning everything after the fist GPT and BERT, then that's absolutely not right. The first LLM that demonstrated the ability to generate coherent, fluently grammatical English was GPT-2. That story about the unicorns- that was the first time a statistical language model was able to generate text that stayed on the subject over a long distance and made (some) sense.

GPT-2 was followed by GPT 3 and GPT 3.5 that turned the hype dial up to 11 and were certainly "public" at least if that means publicly available. They were coherent enough that many people predicted all sorts of fancy things, like the end of programming jobs and the end of journalist jobs and so on.

So, weird statement that one and it kind of makes me wary of Gell-Mann amnesia while reading the article.

By @vnjxk - 4 months

Isn't there a math theorem programming language or something?

By @vouaobrasil - 4 months

My favourite moments of being a graduate student in math was showing my friends (and sometimes professors) proofs of propositions and theorems that we discussed together. To be the first to put together a coherent piece of reasoning that would convince them of the truth was immensely exciting. Those were great bonding moments amongst colleagues. The very fact that we needed each other to figure out the basics of the subject was part of what made the journey so great.

Now, all of that will be done by AI.

Reminds of the time when I finally enabled invincibility in Goldeneye 007. Rather boring.

I think we've stopped to appreciate the human struggle and experience and have placed all the value on the end product, and that's we're developing AI so much.

Yeah, there is the possibility of working with an AI but at that point, what is the point? Seems rather pointless to me in an art like mathematics.

By @rishicomplex - 4 months

Who is the author?

By @sylware - 4 months

How to train an AI strapped to a formal solver.

By @Onavo - 4 months

Considering that they have Terence Tao himself working on the problem, betting against it would be unwise.

By @witnesser2 - 4 months

I was not refuted sufficiently a couple of years ago. I claimed "training is open boundary" etc.

By @est - 4 months

At this stage I assume everything having a sequencial pattern can and will be automated by LLM AIs.

By @sincerecook - 4 months

No it can't, and there's no such thing as AI. How is a thing that predicts the next-most-likely word going to do novel math? It can't even do existing math reliably because logical operations and statistical approximation are fundamentally different. It is fun watching grifters put lipstick on this thing and shop it around as a magic pig though.

By @0points - 4 months

> How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon.

Money cannot solve the issues faced by the industry which mainly revolves around lack of training data.

They already used the entirety of the internet, all available video, audio and books and they are now dealing with the fact that most content online is now generated by these models, thus making it useless as training data.

By @mangomountain - 4 months

In other news we’ve discovered life (our bacteria) on mars Just joking

By @lproven - 4 months

Betteridge's Law applies.

By @noFaceDiscoG668 - 4 months

"once" the training data can do it, LLMs will be able to do it. and AI will be able to do math once it comes to check out the lights of our day and night. until then it'll probably wonder continuously and contiguously: "wtf! permanence! why?! how?! by my guts, it actually fucking works! why?! how?!"

By @alphan0n - 4 months

As far as ChatGPT goes, you may as well be asking: Can AI use a calculator?

The answer is yes, it can utilize a stateful python environment and solve complex mathematical equations with ease.

By @jokoon - 4 months

I wish scientists who do psychology and cognition of actual brains could approach those AI things and talk about it, and maybe make suggestions.

I really really wish AI would make some breakthrough and be really useful, but I am so skeptical and negative about it.

By @puttycat - 4 months

No: https://github.com/0xnurl/gpts-cant-count

By @ashoeafoot - 4 months

Ai has a interior world model thus it can do math if a chain of proof is walking without uncertainty from room to room. the problem is its inability to reflect on its own uncertainty and to then overrife that uncertainty ,should a new room entrance method be selfsimilar to a previous entrance

By @casenmgreen - 4 months

I may be wrong, but I think it a silly question. AI is basically auto-complete. It can do math to the extent you can find a solution via auto-complete based on an existing corpus of text.

By @intellix - 4 months

I haven't checked in a while, but last I checked ChatGPT it struggled on very basic things like: how many Fs are in this word? Not sure if they've managed to fix that but since that I had lost hope in getting it to do any sort of math

By @yodsanklai - 4 months

I understand the appeal of having a machine helping us with maths and expanding the frontier of knowledge. They can assist researchers and make them more productive. Just like they can make already programmers more productive.

But maths are also fun and fulfilling activity. Very often, when we learn a math theory, it's because we want to understand and gain intuition on the concepts, or we want to solve a puzzle (for which we can already look up the solution). Maybe it's similar to chess. We didn't develop search engines to replace human players and make them play together, but they helped us become better chess players or understanding the game better.

So the recent progress is impressive, but I still don't see how we'll use this tech practically and what impacts it can have and in which fields.

Can AI do maths yet? Thoughts from a mathematician

Related

AI solves IMO problems at silver medal level

Google DeepMind's AI systems can now solve complex math problems

New secret math benchmark stumps AI models and PhDs alike

New secret math benchmark stumps AI models and PhDs alike

OpenAI O3 breakthrough high score on ARC-AGI-PUB

Related

AI solves IMO problems at silver medal level

Google DeepMind's AI systems can now solve complex math problems

New secret math benchmark stumps AI models and PhDs alike

New secret math benchmark stumps AI models and PhDs alike

OpenAI O3 breakthrough high score on ARC-AGI-PUB