February 2nd, 2025

I compared my daughter against SOTA models on math puzzles

Michał Prządka's experiment revealed that his daughter Anika solved 7 of 8 math puzzles, while AI models varied in performance, highlighting both AI's strengths and limitations in reasoning tasks.

Read original articleLink Icon
I compared my daughter against SOTA models on math puzzles

Michał Prządka conducted an experiment comparing the mathematical reasoning abilities of advanced AI models to those of his 11-year-old daughter, Anika, using puzzles from the GMIL competition. The results showed that while some AI models performed comparably to Anika, others struggled significantly. Anika solved 7 out of 8 puzzles correctly, while the AI models varied in their success. The benchmark included models like GPT-o3-mini, which achieved a perfect score, and DeepSeek-R1, which also performed well but relied on brute-force methods. In contrast, models like Sonnet-3.5 and Gemini-Flash had lower success rates. The experiment highlights the current capabilities of AI in mathematical reasoning and suggests that while AI can match human performance in some areas, there are still limitations. Prządka expressed pride in his daughter's skills, noting that they are on par with advanced AI models, and he plans to extend this analysis to more challenging puzzles in the future.

- Anika scored 7 out of 8 correct in the math reasoning benchmark.

- GPT-o3-mini achieved a perfect score, while other models had varying success rates.

- The experiment highlights the strengths and weaknesses of AI in mathematical reasoning.

- Prządka plans to explore more difficult puzzles to further assess AI capabilities.

- The results indicate that AI can match human performance in certain tasks but still has limitations.

Link Icon 5 comments
By @przadka - 2 months
I tested o3, r1, 4o and other SOTA models against puzzles from an international math competition and compared their performance with my 11-year-old daughter's solutions. Full results include detailed conversations with each model and complete methodology.
By @ukituki - 2 months
Interesting how the reasoning differs between models, e.g. DeepSeek trying the brute force tricks
By @michalwarda - 2 months
Very cool post! I wonder how much will it affect the psychology of next generations.