February 2nd, 2025

I compared my daughter against SOTA models on math puzzles

Michał Prządka's experiment revealed that his daughter Anika solved 7 of 8 math puzzles, while AI models varied in performance, highlighting both AI's strengths and limitations in reasoning tasks.

Read original article

I compared my daughter against SOTA models on math puzzles

Michał Prządka conducted an experiment comparing the mathematical reasoning abilities of advanced AI models to those of his 11-year-old daughter, Anika, using puzzles from the GMIL competition. The results showed that while some AI models performed comparably to Anika, others struggled significantly. Anika solved 7 out of 8 puzzles correctly, while the AI models varied in their success. The benchmark included models like GPT-o3-mini, which achieved a perfect score, and DeepSeek-R1, which also performed well but relied on brute-force methods. In contrast, models like Sonnet-3.5 and Gemini-Flash had lower success rates. The experiment highlights the current capabilities of AI in mathematical reasoning and suggests that while AI can match human performance in some areas, there are still limitations. Prządka expressed pride in his daughter's skills, noting that they are on par with advanced AI models, and he plans to extend this analysis to more challenging puzzles in the future.

- Anika scored 7 out of 8 correct in the math reasoning benchmark.

- GPT-o3-mini achieved a perfect score, while other models had varying success rates.

- The experiment highlights the strengths and weaknesses of AI in mathematical reasoning.

- Prządka plans to explore more difficult puzzles to further assess AI capabilities.

- The results indicate that AI can match human performance in certain tasks but still has limitations.

Google DeepMind's AI systems can now solve complex math problems

Google DeepMind's AI systems, AlphaProof and AlphaGeometry 2, solved four of six problems from the International Mathematical Olympiad, achieving a silver medal and marking a significant advancement in AI mathematics capabilities.

Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw

Apple researchers found that many AI models struggle with basic arithmetic when irrelevant data is included, highlighting a lack of genuine logical reasoning and cautioning against overestimating AI's intelligence.

New secret math benchmark stumps AI models and PhDs alike

Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.

Can AI do maths yet? Thoughts from a mathematician

OpenAI's o3 scored 25% on the FrontierMath dataset, indicating progress in AI's mathematical capabilities, but experts believe it still lacks the innovative thinking required for advanced mathematics.

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Scale AI and CAIS introduced "Humanity’s Last Exam" to assess AI reasoning, revealing current models answered under 10% of expert questions correctly. The dataset will support further research on AI limitations.

5 comments

By @przadka - 2 months

I tested o3, r1, 4o and other SOTA models against puzzles from an international math competition and compared their performance with my 11-year-old daughter's solutions. Full results include detailed conversations with each model and complete methodology.

By @ukituki - 2 months

Interesting how the reasoning differs between models, e.g. DeepSeek trying the brute force tricks

By @michalwarda - 2 months

Very cool post! I wonder how much will it affect the psychology of next generations.

I compared my daughter against SOTA models on math puzzles

Related

Google DeepMind's AI systems can now solve complex math problems

Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw

New secret math benchmark stumps AI models and PhDs alike

Can AI do maths yet? Thoughts from a mathematician

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Related

Google DeepMind's AI systems can now solve complex math problems

Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw

New secret math benchmark stumps AI models and PhDs alike

Can AI do maths yet? Thoughts from a mathematician

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark