People are using Super Mario to benchmark AI now
Researchers at UC San Diego's Hao AI Lab are using Super Mario Bros. to benchmark AI performance, finding Anthropic's Claude 3.7 superior, while raising questions about gaming skills' relevance to real-world applications.
Read original articleResearchers at the University of California San Diego's Hao AI Lab have begun using Super Mario Bros. as a benchmark for evaluating AI performance, claiming it presents a greater challenge than previous benchmarks like Pokémon. In their tests, Anthropic's Claude 3.7 outperformed other models, including Claude 3.5, while Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled. The game was run in an emulator with a framework called GamingAgent, which provided AIs with basic instructions and in-game screenshots to control Mario through Python code. The study revealed that AI models had to develop complex strategies and maneuvers, with reasoning models like OpenAI’s o1 performing worse than non-reasoning models due to their slower decision-making process. The researchers noted that timing is crucial in real-time games like Super Mario, where delays can lead to failure. This benchmarking approach has sparked debate among experts regarding the relevance of gaming skills to real-world AI capabilities, with some expressing uncertainty about the current metrics for evaluating AI performance. The ongoing exploration of AI in gaming highlights an "evaluation crisis" in the field, as researchers seek to understand the true capabilities of these models.
- Super Mario Bros. is being used as a new benchmark for AI performance.
- Anthropic's Claude 3.7 was the top performer in the tests conducted by Hao AI Lab.
- Reasoning models struggled compared to non-reasoning models due to slower decision-making.
- The use of games for AI benchmarking raises questions about the relevance of these skills to real-world applications.
- There is an ongoing debate about the effectiveness of current metrics for evaluating AI capabilities.
Related
OpenAI's Lead over Other AI Companies Has Largely Vanished
OpenAI's competitive edge in AI has diminished as models like Anthropic's Claude 3.5 and Google's Gemini 1.5 match or surpass GPT-4o, while inference costs decline significantly due to competition.
Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark
Scale AI and CAIS introduced "Humanity’s Last Exam" to assess AI reasoning, revealing current models answered under 10% of expert questions correctly. The dataset will support further research on AI limitations.
I compared my daughter against SOTA models on math puzzles
Michał Prządka's experiment revealed that his daughter Anika solved 7 of 8 math puzzles, while AI models varied in performance, highlighting both AI's strengths and limitations in reasoning tasks.
Older AI models show signs of cognitive decline, study shows
A study in the BMJ reveals older AI models show cognitive decline, raising concerns for medical diagnostics. Critics argue human cognitive tests are unsuitable for evaluating AI performance.
OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems
OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.
Related
OpenAI's Lead over Other AI Companies Has Largely Vanished
OpenAI's competitive edge in AI has diminished as models like Anthropic's Claude 3.5 and Google's Gemini 1.5 match or surpass GPT-4o, while inference costs decline significantly due to competition.
Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark
Scale AI and CAIS introduced "Humanity’s Last Exam" to assess AI reasoning, revealing current models answered under 10% of expert questions correctly. The dataset will support further research on AI limitations.
I compared my daughter against SOTA models on math puzzles
Michał Prządka's experiment revealed that his daughter Anika solved 7 of 8 math puzzles, while AI models varied in performance, highlighting both AI's strengths and limitations in reasoning tasks.
Older AI models show signs of cognitive decline, study shows
A study in the BMJ reveals older AI models show cognitive decline, raising concerns for medical diagnostics. Critics argue human cognitive tests are unsuitable for evaluating AI performance.
OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems
OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.