March 4th, 2025

People are using Super Mario to benchmark AI now

Researchers at UC San Diego's Hao AI Lab are using Super Mario Bros. to benchmark AI performance, finding Anthropic's Claude 3.7 superior, while raising questions about gaming skills' relevance to real-world applications.

Read original articleLink Icon
People are using Super Mario to benchmark AI now

Researchers at the University of California San Diego's Hao AI Lab have begun using Super Mario Bros. as a benchmark for evaluating AI performance, claiming it presents a greater challenge than previous benchmarks like Pokémon. In their tests, Anthropic's Claude 3.7 outperformed other models, including Claude 3.5, while Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled. The game was run in an emulator with a framework called GamingAgent, which provided AIs with basic instructions and in-game screenshots to control Mario through Python code. The study revealed that AI models had to develop complex strategies and maneuvers, with reasoning models like OpenAI’s o1 performing worse than non-reasoning models due to their slower decision-making process. The researchers noted that timing is crucial in real-time games like Super Mario, where delays can lead to failure. This benchmarking approach has sparked debate among experts regarding the relevance of gaming skills to real-world AI capabilities, with some expressing uncertainty about the current metrics for evaluating AI performance. The ongoing exploration of AI in gaming highlights an "evaluation crisis" in the field, as researchers seek to understand the true capabilities of these models.

- Super Mario Bros. is being used as a new benchmark for AI performance.

- Anthropic's Claude 3.7 was the top performer in the tests conducted by Hao AI Lab.

- Reasoning models struggled compared to non-reasoning models due to slower decision-making.

- The use of games for AI benchmarking raises questions about the relevance of these skills to real-world applications.

- There is an ongoing debate about the effectiveness of current metrics for evaluating AI capabilities.

Link Icon 0 comments