Can AI do maths yet? Thoughts from a mathematician
OpenAI's o3 scored 25% on the FrontierMath dataset, indicating progress in AI's mathematical capabilities, but experts believe it still lacks the innovative thinking required for advanced mathematics.
Read original articleOpenAI's new language model, o3, recently scored 25% on the FrontierMath dataset, which consists of challenging mathematical problems designed to have definitive, computable answers. The dataset, curated by Epoch AI, is secretive to prevent language models from training on it directly. The problems are not about proving theorems but rather finding specific numbers, making them difficult even for experienced mathematicians. While o3's performance indicates progress in AI's mathematical capabilities, experts believe it still falls short of the innovative thinking required at advanced levels of mathematics. The mathematician quoted in the article expressed surprise at the model's score, as they expected AI to remain at a lower competency level for some time. The discussion also highlights the distinction between numerical problem-solving and theorem proving, with the latter being a more complex challenge. DeepMind's AlphaProof has shown success in solving International Mathematics Olympiad problems, but the debate continues regarding the grading of AI-generated solutions. The future of AI in mathematics remains uncertain, with ongoing advancements but significant hurdles to overcome before AI can match human mathematicians in creativity and understanding.
- OpenAI's o3 scored 25% on the challenging FrontierMath dataset.
- The dataset consists of problems requiring specific numerical answers, not proofs.
- Experts believe AI is still far from achieving advanced mathematical reasoning.
- DeepMind's AlphaProof has successfully solved some Olympiad problems.
- The distinction between numerical problem-solving and theorem proving is crucial in evaluating AI's capabilities.
Related
AI solves IMO problems at silver medal level
Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four out of six International Mathematical Olympiad problems, achieving a silver medalist level, marking a significant milestone in AI mathematical reasoning.
Google DeepMind's AI systems can now solve complex math problems
Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four of six problems from the International Mathematical Olympiad, achieving a silver medal and marking a significant advancement in AI mathematics capabilities.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.
OpenAI O3 breakthrough high score on ARC-AGI-PUB
OpenAI's o3 system scored 75.7% on the ARC-AGI-Pub benchmark, showing improved adaptability but still not classified as AGI. Upcoming challenges may further test its capabilities. Community analysis is encouraged.
- Many commenters acknowledge AI's potential as a tool to enhance human mathematical abilities rather than replace them.
- Concerns are raised about the reliability of AI in performing complex mathematical tasks, with several users sharing experiences of AI making fundamental errors.
- There is skepticism regarding the validity of the FrontierMath dataset, with some suggesting it may have been compromised.
- Commenters express a fear of obsolescence in the field of mathematics due to advancements in AI, while others argue that human creativity and intuition remain irreplaceable.
- Overall, the conversation reflects a broader debate about the role of AI in academia and its implications for the future of mathematical research.
I'm a research mathematician. In the 1980's I'd ask everyone I knew a question, and flip through the hard bound library volumes of Mathematical Reviews, hoping to recognize something. If I was lucky, I'd get a hit in three weeks.
Internet search has shortened this turnaround. One instead needs to guess what someone else might call an idea. "Broken circuits?" Score! Still, time consuming.
I went all in on ChatGPT after hearing that Terry Tao had learned the Lean 4 proof assistant in a matter of weeks, relying heavily on AI advice. It's clumsy, but a very fast way to get suggestions.
Now, one can hold involved conversations with ChatGPT or Claude, exploring mathematical ideas. AI is often wrong, never knows when it's wrong, but people are like this too. Read how the insurance incidents for self-driving taxis are well below the human incident rates? Talking to fellow mathematicians can be frustrating, and so is talking with AI, but AI conversations go faster and can take place in the middle of the night.
I don't want AI to prove theorems for me, those theorems will be as boring as most of the dreck published by humans. I want AI to inspire bursts of creativity in humans.
Also Glazer seemed to regret calling T1 "IMO/undergraduate", and not only because of the disparity between IMO and typical undergraduate. He said that "We bump problems down a tier if we feel the difficulty comes too heavily from applying a major result, even in an advanced field, as a black box, since that makes a problem vulnerable to naive attacks from models"
Also, all of the problems shows to Tao were T3
O1 is a lot better at spotting its errors than 4o but it too still makes a lot of really stupid mistakes. It seems to be quite far from producing results itself consistently without at least a somewhat clueful human doing hand-holding.
[ (Re)imagining mathematics in a world of reasoning machines by Akshay Venkatesh]
https://www.youtube.com/watch?v=vYCT7cw0ycw [54min]
Abstract: In the coming decades, developments in automated reasoning will likely transform the way that research mathematics is conceptualized and carried out. I will discuss some ways we might think about this. The talk will not be about current or potential abilities of computers to do mathematics—rather I will look at topics such as the history of automation and mathematics, and related philosophical questions.
See discussion at https://news.ycombinator.com/item?id=42465907
But I'm wondering what other people think of this analogy.
I used to be a bench scientist (molecular genetics).
There were world class researchers who were more creative than I was. I even had a Nobel Laureate once tell me that my research was simply "dotting 'i's and crossing 't's".
Nevertheless, I still moved the field forward in my own small ways. I still did respectable work.
So, will these LLMs make us completely obsolete? Or will there still be room for those of us who can dot the "i"?--if only for the fact that LLMs don't have infinite time/resources to solve "everything."
I don't know. Maybe I'm whistling past the graveyard.
Historically, the claim that neural nets were actual models of the human brain and human thinking was always epistemically dubious. It still is. Even as the practical problems of producing better and better algorithms, architectures, and output have been solved, there is no reason to believe a connection between the mechanical model and what happens in organisms has been established. The most important point, in my view, is that all of the representation and interpretation still has to happen outside the computational units. Without human interpreters, none of the AI outputs have any meaning. Unless you believe in determinism and an overseeing god, the story for human beings is much different. AI will not be capable of reason until, like humans, it can develop socio-rational collectivities of meaning that are independent of the human being.
Researchers seemed to have a decent grasp on this in the 90s, but today, everyone seems all too ready to make the same ridiculous leaps as the original creators of neural nets. They did not show, as they claimed, that thinking is reducible to computation. All they showed was that a neural net can realize a boolean function—which is not even logic, since, again, the entire semantic interpretive side of the logic is ignored.
1. I asked it to show me the derivation of a formula for the efficiency of Stop-and-Wait ARQ and it seemed to do it, but a day later, I realised that in one of the steps, it just made a term vanish to get to the next step. Obviously, I should have verified more carefully, but when I asked it to spot the mistake in that step, it did the same thing twice more with bs explanations of how the term is absorbed.
2. I asked it to provide me syllogisms that I could practice proving. An overwhelming number of the syllogisms it gave me were inconsistent and did not hold. This surprised me more because syllogisms are about the most structured arguments you can find, having been formalized centuries ago and discussed extensively since then. In this case, asking it to walk step-by-step actually fixed the issue.
Both of these were done on the free plan of ChatGPT, but I can remember if it was 4o or 4.
In the same way ChatGPT scores 25% on this and the question is "How close were those 25% to questions in the training set". Or to put it another way we want to answer the question "Is ChatGPT getting better at applying it's reasoning to out-of-set problems or is it pulling more data into it's training set". Or "Is the test leaking into the training".
Maybe the whole question is academic and it doesn't matter, we solve the entire problem by pulling all human knowledge into the training set and that's a massive benefit. But maybe it implies a limit to how far it can push human knowledge forward.
Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".
But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.
Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.
But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.
I wonder what the response of working mathematicians will be to this. If the proofs look credible it might be too tempting to try and validate them, but if there’s a deluge that could be a hug time sync. Imagine if Wiles or Perelman had produced a thousand different proofs for their respective problems.
It helps me a lot when I feel lost. It's often wrong in the calculations, but it's cool to have a study buddy that doesn't judge you.
If I get blocked with a problem I can't solve, I ask for assistance with my approach.
I enjoy asking ChatGPT about the context behind all that math theory. It's nice to elaborate on that as most of the math books are very lean and provide no applied context.
But the problem then is that one can suppose there are also true short statements in ZFC which likewise require doubly exponential time to reach via any path. Presburger Arithmetic is decidable whereas ZFC is not, so these statements would require the additional axioms of ZFC for shorter proofs, but I think it's safe to assume such statements exist.
Now let's suppose an AI model can resolve the truth of these short statements quickly. That means one of three things:
1) The AI model can discover doubly exponential length proof paths within the framework of ZFC.
2) There are certain short statements in the formal language of ZFC that the AI model cannot discover the truth of.
3) The AI model operates outside of ZFC to find the truth of statements in the framework of some other, potentially unknown formal system (and for arithmetical statements, the system must necessarily be sound).
How likely are each of these outcomes?
1) is not possible within any coherent, human-scale timeframe.
2) IMO is the most likely outcome, but then this means there are some really interesting things in mathematics that AI cannot discover. Perhaps the same set of things that humans find interesting. Once we have exhausted the theorems with short proofs in ZFC, there will still be an infinite number of short and interesting statements that we cannot resolve.
3) This would be the most bizarre outcome of all. If AI operates in a consistent way outside the framework of ZFC, then that would be equivalent to solving the halting problem for certain (infinite) sets of Turing machine configurations that ZFC cannot solve. That in itself itself isn't too strange (e.g., it might turn out that ZFC lacks an axiom necessary to prove something as simple as the Collatz conjecture), but what would be strange is that it could find these new formal systems efficiently. In other words, it would have discovered an algorithmic way to procure new axioms that lead to efficient proofs of true arithmetic statements. One could also view that as an efficient algorithm for computing BB(n), which obviously we think isn't possible. See Levin's papers on the feasibility of extending PA in a way that leads to quickly discovering more of the halting sequence.
Society is CLEARLY not ready for what AI's impact is going to be. We've been through change before, but never at this scale and speed. I think Musk/Vivek's DOGE thing is important, our governent has gotten quite large and bureaucratic. But the clock has started on AI, and this is a social structural issue we've gotta figure out. Putting it off means we probably become subjects to a default set of rulers if not the shoggoth itself.
But I am confident the answer to the question in the headline is "no, not for several decades." It's not just the underwhelming benchmark results discussed in the post, or the general concern about hard undergraduate math using different skillsets than ordinary research math. IMO the deeper problem still seems to be a basic gap where LLMs can seemingly do formal math at the level of a smart graduate student but fail at quantitative/geometric reasoning problems designed for fish. I suspect this holds for O3, based on one of the ARC problems it wasn't able to solve: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... (via https://www.interconnects.ai/p/openais-o3-the-2024-finale-of...) ANNs are simply not able to form abstractions, they can only imitate them via enormous amounts of data and compute. I would say there has been zero progress on "common sense" math in computers since the invention of Lisp: we are still faking it with expert systems, even if LLM expert systems are easier to build at scale with raw data.
It is the same old problem where an ANN can attain superhuman performance on level 1 of Breakout, but it has to be retrained for level 2. I am not convinced it makes sense to say AI can do math if AI doesn't understand what "four" means with the same depth as a rat, even if it can solve sophisticated modular arithmetic problems. In human terms, does it make sense to say a straightedge-and-compass AI understands Euclidean geometry if it's not capable of understanding the physical intuition behind Euclid's axioms? It makes more sense to say it's a brainless tool that helps with the tedium and drudgery of actually proving things in mathematics.
The database stopped being secret when it was fed to proprietary LLMs running in the cloud. If anyone is not thinking that OpenAI has trained and tuned O3 on the "secret" problems people fed to GPT-4o, I have a bridge to sell you.
I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive)
Seems to me the answer to 'Can AI do maths yet?' depends on what you call AI and what you call maths. Our old departmental VAX running at a handfull of megahertz could do some very clever symbol manipulation on binomials and if you gave it a few seconds, it could even do something like theorum proving via proto-prolog. Neither are anywhere close to the glorious GAI future we hope to sell to industry and government, but it seems worth considering how they're different, why they worked, and whether there's room for some hybrid approach. Do LLMs need to know how to do math if they know how to write Prolog or Coc statements that can do interesting things?
I've heard people say they want to build software that emulates (simulates?) how humans do arithmetic, but ask a human to add anything bigger than two digit numbers and the first thing they do is reach for a calculator.
So the higher the cost the better the performance. While models and hardware can be improved the curve is still steep.
The big answer is what are people using it for? We'll they are using lightweight simplistic models to do targeted tasks. To do many smaller and easier to process tasks.
Most of the news on AI is just there to promote a product to earn more cash.
If that's referring to Large Language Models, meaning everything after the fist GPT and BERT, then that's absolutely not right. The first LLM that demonstrated the ability to generate coherent, fluently grammatical English was GPT-2. That story about the unicorns- that was the first time a statistical language model was able to generate text that stayed on the subject over a long distance and made (some) sense.
GPT-2 was followed by GPT 3 and GPT 3.5 that turned the hype dial up to 11 and were certainly "public" at least if that means publicly available. They were coherent enough that many people predicted all sorts of fancy things, like the end of programming jobs and the end of journalist jobs and so on.
So, weird statement that one and it kind of makes me wary of Gell-Mann amnesia while reading the article.
Now, all of that will be done by AI.
Reminds of the time when I finally enabled invincibility in Goldeneye 007. Rather boring.
I think we've stopped to appreciate the human struggle and experience and have placed all the value on the end product, and that's we're developing AI so much.
Yeah, there is the possibility of working with an AI but at that point, what is the point? Seems rather pointless to me in an art like mathematics.
Money cannot solve the issues faced by the industry which mainly revolves around lack of training data.
They already used the entirety of the internet, all available video, audio and books and they are now dealing with the fact that most content online is now generated by these models, thus making it useless as training data.
The answer is yes, it can utilize a stateful python environment and solve complex mathematical equations with ease.
I really really wish AI would make some breakthrough and be really useful, but I am so skeptical and negative about it.
But maths are also fun and fulfilling activity. Very often, when we learn a math theory, it's because we want to understand and gain intuition on the concepts, or we want to solve a puzzle (for which we can already look up the solution). Maybe it's similar to chess. We didn't develop search engines to replace human players and make them play together, but they helped us become better chess players or understanding the game better.
So the recent progress is impressive, but I still don't see how we'll use this tech practically and what impacts it can have and in which fields.
Related
AI solves IMO problems at silver medal level
Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four out of six International Mathematical Olympiad problems, achieving a silver medalist level, marking a significant milestone in AI mathematical reasoning.
Google DeepMind's AI systems can now solve complex math problems
Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four of six problems from the International Mathematical Olympiad, achieving a silver medal and marking a significant advancement in AI mathematics capabilities.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.
OpenAI O3 breakthrough high score on ARC-AGI-PUB
OpenAI's o3 system scored 75.7% on the ARC-AGI-Pub benchmark, showing improved adaptability but still not classified as AGI. Upcoming challenges may further test its capabilities. Community analysis is encouraged.