March 24th, 2025

Why Anthropic's Claude still hasn't beaten Pokémon

Anthropic's Claude 3.7 Sonnet shows improved reasoning in Pokémon but struggles with gameplay mechanics, low-resolution graphics, and memory retention, highlighting both AI advancements and ongoing challenges in achieving human-level intelligence.

Read original article

Why Anthropic's Claude still hasn't beaten Pokémon

Anthropic's AI model, Claude 3.7 Sonnet, has made strides in playing Pokémon, showcasing improved reasoning capabilities compared to its predecessors. Despite this progress, Claude still struggles with basic gameplay mechanics, often getting stuck or revisiting completed areas, which highlights its limitations in navigating the game effectively. The model operates without specific training for Pokémon, relying instead on its general understanding of the world. While it performs well in text-based interactions, such as understanding battle mechanics, it falters in interpreting the low-resolution graphics of the game. Claude's memory limitations also hinder its ability to retain learned information, leading to repeated mistakes. Observations of Claude's gameplay reveal that while it can develop coherent strategies at times, it often fails to adapt effectively, demonstrating the challenges AI faces in achieving human-level intelligence. The ongoing experiment serves as a reflection of the current state of AI research, emphasizing both the advancements made and the significant hurdles that remain.

- Claude 3.7 Sonnet shows improved reasoning but struggles with basic gameplay in Pokémon.

- The model operates without specific training for the game, relying on general knowledge.

- It excels in text-based interactions but has difficulty interpreting low-resolution graphics.

- Memory limitations lead to repeated mistakes and challenges in retaining learned information.

- The experiment highlights both advancements in AI and the significant challenges that remain.

Claude 3.5 Sonnet

Claude 3.5 Sonnet, the latest in the model family, excels in customer support, coding, and humor comprehension. It introduces Artifacts on Claude.ai for real-time interactions, prioritizing safety and privacy. Future plans include Claude 3.5 Haiku and Opus, emphasizing user feedback for continuous improvement.

Claude Computer Use – Is Vision the Ultimate API?

The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.

Claude 3.7 Sonnet and Claude Code

Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model improving coding tasks and web development. It includes Claude Code for automation, maintains previous pricing, and enhances safety with fewer refusals.

People are using Super Mario to benchmark AI now

Researchers at UC San Diego's Hao AI Lab are using Super Mario Bros. to benchmark AI performance, finding Anthropic's Claude 3.7 superior, while raising questions about gaming skills' relevance to real-world applications.

13 comments

By @kibwen - 18 days

> Claude will readily notice when the game tells it that an attack from an electric-type Pokémon is “not very effective” against a rock-type opponent, for instance. Claude will then squirrel that factoid away in a massive written knowledge base for future reference later in the run.

But these models already know all this information??? Surely it's ingested Bulbapedia, along with a hundred zillion terabytes of every other Pokemon resource on the internet, so why does it need to squirrel this information away? What's the point of ruining the internet with all this damn parasitic crawling if the models can't even recall basic facts like "thunderbolt is an electric-type move", "geodude is a rock-type pokemon", "electric-type moves are ineffective against rock-type pokemon"?

By @Recursing - 18 days

This Sunday there's going to be a hackaton at https://lu.ma/poke to improve the Pokemon-playing agent. I think most hackatons don't achieve much, but maybe someone from HN can go improve the scaffolding and save Claude

By @blainm - 18 days

So when it really struggled to get around (kept just walking into obstacles), they gave Claude the ability to navigate by adding pathfinding and awareness of its position and map ID. However, it still struggles, particularly in open-ended areas.

This suggests a fundamental issue beyond just navigation. While accessing more RAM data or using external tools using said data could improve consistency or get it further, that approach reduces the extent to which Claude is independently playing and reasoning.

A more effective solution would enhance its decision-making without relying on direct RAM access or any kind of fine tuning. I'm sure it's possible.

There has to be a better approach, and also in a way that's not relying on reading values from RAM or any kind of fine tuning.

By @barotalomey - 18 days

I very much feel nostalgic now, thinking about the time when Twitch plays pokemon was something we all enjoyed.

By @disambiguation - 18 days

We all know the shortcomings of LLMs but its been interesting to observe Agency as a system and the flaws that emerge. For example, does Claude have any reason to get from point A to B quickly and efficiently? Running into walls, running in circles, back tracking, etc. Clearly it doesn't have a sense of urgency, but does it matter as long as it eventually gets there?

But here's what strikes me as odd about the whole approach. Everyone knows it's easier for an LLM to write a program that counts the number of R's in "strawberry" than it is to count it directly, yet no one is leveraging this fact.

Instead of ever more elaborate prompts and contexts, why not ask Claude to write a program for mapping and path finding? Hell if that's too much to ask, maybe make the tools before hand and see if it's at least smart enough to use them effectively.

My personal wishlist is things like fact tables, goal graphs, and a word model - where things are and when things happened. Strategies and hints. All these things can be turned into formal systems. Something as simple as a battle calculator should be a no-brainer.

My last hair brained idea - I would like to see LLMs managing an ontology in prolog as a proxy for reasoning.

This all theory and even if implemented wouldn't solve everything, but I'm tired of watching them throw prompts at the wall in hopes that the LLM can be tricked into being smarter than it is.

By @cruffle_duffle - 18 days

These things are not general purpose AI tools. The are Large Language tools.

There are dudes on YouTube who get millions of views doing basic reinforcement learning to train weird shapes to navigate obstacle course, win races, learn to walk, etc. But they do this by making a custom model with inputs and outputs that are directly mapped into the “physical world” which these creatures live.

Until these LLM’s have input and output parameters that specifically wire into “front distance sensor or leg pressure sensor” and “leg muscles or engine speed” they are never going to be truly good at tasks requiring such interaction.

Any such attempt that lacks such inputs and outputs and somehow manages to have passable results will be in spite of the model not because of it. They’ll always get their ass kicked by specialized models trained for such tasks on every dimension including efficiency, power, memory use, compute and size.

And that is the thing, despite their incredible capabilities, LLM’s are not AGI and they are not general purpose models either! And they never will be. And that is just fine.

By @smjburton - 18 days

> But despite recent advances in AI image processing, Hershey said Claude still struggles to interpret the low-resolution, pixelated world of a Game Boy screenshot as well as a human can.

> We built the text side of it first, and the text side is definitely... more powerful. How these models can reason about images is getting better, but I think it's a decent bit behind.

This seems to be the main issue: using an AI model predominantly trained on text-based reasoning to play through a graphical video game challenge. Given this context, the image processing for this model is like an underdeveloped skill compared to its text-based reasoning. Even though it spent an excessive amount of time navigating through Mt. Moon or getting trapped in small areas of the map, Claude will likely only get better at playing Pokemon or other games as it's trained on more image-related tasks and the model balances out its capabilities.

By @0xbadcafebee - 18 days

Anyone who has used AI recently knows how much bullshit the claims are about replacing any human for virtually any task. AI still can't do super advanced things like tell me the correct height of my truck from the manufacturer's tech specs PDF that it has in its databank. Even when I tell it what the correct height is, it'll randomly change the height throughout a session. I have to constantly correct it because thankfully I know enough about a given subject that I know it's bullshitting. Once I correct it, it suddenly admits, oh yeah, actually it was this other thing, here's the right info.

It's an amazing search engine, and has really cool suggestions/ideas. It's certainly very useful. But replace a human? Get real. All these stocks are going to tank once the media starts running honest articles instead of PR fluff.

By @roxolotl - 18 days

I wonder how much help probably having tons of guides in its training material is. This really does feel like the sort of test that’s actually interesting for evaluating how general these AI can be especially compared to the existing tests.

By @jerf - 18 days

LLMs are, to put it in human brain terms, super, super, super hypertrophied speech centers. The human brain is completely capable of having grown a hypertrophied speech center if that was the path to intelligence. It did not, because that is not the path to general intelligence.

Of course LLMs can't play Pokemon long term. Used this way, they're the AI equivalent of someone who has the beginnings of dementia, who knows it, and is compensating by metaphorically sitting in a corner rocking to themselves repeating things so they don't forget them, because all they can remember is what they said.

It's amazing what they can do. I'm not denying anything they have actually done, because things that have been concretely been done, are done.

But the AIs that will actually fulfill the promises of this generation of AI are the ones yet to come, that incorporate some sort of memory-analog that isn't just "I repeated something to myself inside my token window", and some sort of symbolic manipulation capability.

In both cases, I'm using "some sort of" very broadly. I don't know what that looks like any more than anyone else. For instance, to obtain "symbolic manipulation capability" I don't necessarily mean hooking up a symbolic manipulation package. Humans can manipulate symbols but we clearly do it in a somewhat inefficient and often incorrect way with our neural nets. Even so, we get a great deal of benefit from it. We don't have integrated symbolic manipulation packages, and using the ones that exist is actually a rather rare and difficult skill. But what LLMs do is really quite different from either what humans or software packages do.

However I think it's pretty clear that LLMs aren't really going to get much farther than they have now in terms of raw capability. People will continue to find clever ways to use them, of course, but the raw capabilities of LLMs qua LLMs are probably pretty close to tapped out.

I expect in the future that what we today call "LLMs" will become a component of a system that parses text into vectors and then feeds those vectors into what we consider the "real" AI as those vectors, and the AI will in turn emit vectors back into the LLM module that will be converted to human speech. And I rather suspect those LLMs, since they won't be the thing trying to do the "real work", will actually be quite a bit smaller than they are today, with the bulk of the computational power being allocated to the "real" AI. We won't be hypertrophying the LLM layer to try to provide a poor and tiny memory and maintain the state of whatever is happening because the "real" AI layer will be doing that. The future will probably laugh at us sitting here trying to make the language center bigger and bigger and bigger when obviously the correct answer was to... do whatever it is they did in their past but our future.

By @boomboomsubban - 18 days

I'm surprised it hasn't indexed some speed runners notes and is able to recreate them. I wonder how it'd do if you asked it to compete in the "any%" speed run category.

By @skybrian - 18 days

Interesting analysis. Would anyone care to bet on whether Claude will beat Pokémon in a couple of years? It seems like a good question for a prediction market.

My guess is that image understanding will improve.

By @rsynnott - 18 days

> OpenAI is quietly seeding expectations for a "PhD-level" AI agent that could operate autonomously at the level of a "high-income knowledge worker" in the near future. Elon Musk says that "we'll have AI smarter than any one human probably" by the end of 2025. Anthropic CEO Dario Amodei thinks it might take a bit longer but similarly says it's plausible that AI will be "better than humans at almost everything" by the end of 2027.

I am baffled that anyone is still buying this complete nonsense. Like, come on.

Why Anthropic's Claude still hasn't beaten Pokémon

Related

Claude 3.5 Sonnet

Claude Computer Use – Is Vision the Ultimate API?

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

Claude 3.7 Sonnet and Claude Code

People are using Super Mario to benchmark AI now

Related

Claude 3.5 Sonnet

Claude Computer Use – Is Vision the Ultimate API?

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

Claude 3.7 Sonnet and Claude Code

People are using Super Mario to benchmark AI now