March 4th, 2025

People are using Super Mario to benchmark AI now

Researchers at UC San Diego's Hao AI Lab are using Super Mario Bros. to benchmark AI performance, finding Anthropic's Claude 3.7 superior, while raising questions about gaming skills' relevance to real-world applications.

Read original article

People are using Super Mario to benchmark AI now

Researchers at the University of California San Diego's Hao AI Lab have begun using Super Mario Bros. as a benchmark for evaluating AI performance, claiming it presents a greater challenge than previous benchmarks like Pokémon. In their tests, Anthropic's Claude 3.7 outperformed other models, including Claude 3.5, while Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled. The game was run in an emulator with a framework called GamingAgent, which provided AIs with basic instructions and in-game screenshots to control Mario through Python code. The study revealed that AI models had to develop complex strategies and maneuvers, with reasoning models like OpenAI’s o1 performing worse than non-reasoning models due to their slower decision-making process. The researchers noted that timing is crucial in real-time games like Super Mario, where delays can lead to failure. This benchmarking approach has sparked debate among experts regarding the relevance of gaming skills to real-world AI capabilities, with some expressing uncertainty about the current metrics for evaluating AI performance. The ongoing exploration of AI in gaming highlights an "evaluation crisis" in the field, as researchers seek to understand the true capabilities of these models.

- Super Mario Bros. is being used as a new benchmark for AI performance.

- Anthropic's Claude 3.7 was the top performer in the tests conducted by Hao AI Lab.

- Reasoning models struggled compared to non-reasoning models due to slower decision-making.

- The use of games for AI benchmarking raises questions about the relevance of these skills to real-world applications.

- There is an ongoing debate about the effectiveness of current metrics for evaluating AI capabilities.

OpenAI's Lead over Other AI Companies Has Largely Vanished

OpenAI's competitive edge in AI has diminished as models like Anthropic's Claude 3.5 and Google's Gemini 1.5 match or surpass GPT-4o, while inference costs decline significantly due to competition.

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Scale AI and CAIS introduced "Humanity’s Last Exam" to assess AI reasoning, revealing current models answered under 10% of expert questions correctly. The dataset will support further research on AI limitations.

I compared my daughter against SOTA models on math puzzles

Michał Prządka's experiment revealed that his daughter Anika solved 7 of 8 math puzzles, while AI models varied in performance, highlighting both AI's strengths and limitations in reasoning tasks.

Older AI models show signs of cognitive decline, study shows

A study in the BMJ reveals older AI models show cognitive decline, raising concerns for medical diagnostics. Critics argue human cognitive tests are unsuitable for evaluating AI performance.

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.

0 comments

OpenAI's Lead over Other AI Companies Has Largely Vanished

OpenAI's competitive edge in AI has diminished as models like Anthropic's Claude 3.5 and Google's Gemini 1.5 match or surpass GPT-4o, while inference costs decline significantly due to competition.

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

I compared my daughter against SOTA models on math puzzles

Michał Prządka's experiment revealed that his daughter Anika solved 7 of 8 math puzzles, while AI models varied in performance, highlighting both AI's strengths and limitations in reasoning tasks.

Older AI models show signs of cognitive decline, study shows

A study in the BMJ reveals older AI models show cognitive decline, raising concerns for medical diagnostics. Critics argue human cognitive tests are unsuitable for evaluating AI performance.

People are using Super Mario to benchmark AI now

Related

OpenAI's Lead over Other AI Companies Has Largely Vanished

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

I compared my daughter against SOTA models on math puzzles

Older AI models show signs of cognitive decline, study shows

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

Related

OpenAI's Lead over Other AI Companies Has Largely Vanished

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

I compared my daughter against SOTA models on math puzzles

Older AI models show signs of cognitive decline, study shows

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems