November 13th, 2024

New secret math benchmark stumps AI models and PhDs alike

Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.

Read original articleLink Icon
New secret math benchmark stumps AI models and PhDs alike

Epoch AI has introduced FrontierMath, a new mathematics benchmark that poses significant challenges for AI models and expert mathematicians alike. The benchmark consists of hundreds of expert-level problems that leading AI models, including GPT-4o and Claude 3.5 Sonnet, reportedly solve less than 2% of the time. This stark contrast to their performance on simpler benchmarks, where many models score above 90%, highlights the limitations of current AI capabilities. FrontierMath's problems are designed to be unpublished to prevent AI companies from training their models on them, ensuring a more accurate assessment of their problem-solving abilities. Developed in collaboration with over 60 mathematicians, the problems span various mathematical disciplines and have undergone peer review for correctness. Notably, the problems require complex solutions that are difficult to guess, making them particularly challenging for AI systems. Mathematicians like Terence Tao have acknowledged the difficulty of these problems, suggesting that solving them may require a combination of human expertise and AI assistance. Epoch AI plans to conduct regular evaluations of AI models against this benchmark and will release additional sample problems in the future to further aid research in this area.

- FrontierMath is a new benchmark that challenges AI models and mathematicians with expert-level problems.

- Leading AI models solve less than 2% of FrontierMath problems, contrasting with their high performance on simpler benchmarks.

- The benchmark's problems are unpublished to prevent AI training on them, ensuring a fair assessment.

- Developed with input from over 60 mathematicians, the problems span multiple disciplines and require complex solutions.

- Epoch AI plans to regularly evaluate AI models and expand the problem set in the future.

Link Icon 3 comments
By @bell-cot - 3 months
99.9% of the time, "secret" in a headline is cheap counter-factual clickbait.

Here it is both literally true, and critical to the usefulness of the "benchmark". Which is a math test. A math test which the AI's do not get to cheat on, by having seen in their training sets. A math test which the top AI's perform extremely poorly on.

Added: For the top AI's, "stumped" means that they could not solve even 2% of the problems. Vs. for the PhD's, "stumped" means that they needed hours or days to solve the problems. Very different outcomes.

By @jqpabc123 - 3 months
This benchmark is designed to show that current AI lacks reasoning ability and doesn't really "solve" anything.

Retrieving a solution that happens to exist in a large database is not the same as "solving" the problem --- though it can appear otherwise to the ill-informed.