Why Anthropic's Claude still hasn't beaten Pokémon
Anthropic's Claude 3.7 Sonnet shows improved reasoning in Pokémon but struggles with gameplay mechanics, low-resolution graphics, and memory retention, highlighting both AI advancements and ongoing challenges in achieving human-level intelligence.
Read original articleAnthropic's AI model, Claude 3.7 Sonnet, has made strides in playing Pokémon, showcasing improved reasoning capabilities compared to its predecessors. Despite this progress, Claude still struggles with basic gameplay mechanics, often getting stuck or revisiting completed areas, which highlights its limitations in navigating the game effectively. The model operates without specific training for Pokémon, relying instead on its general understanding of the world. While it performs well in text-based interactions, such as understanding battle mechanics, it falters in interpreting the low-resolution graphics of the game. Claude's memory limitations also hinder its ability to retain learned information, leading to repeated mistakes. Observations of Claude's gameplay reveal that while it can develop coherent strategies at times, it often fails to adapt effectively, demonstrating the challenges AI faces in achieving human-level intelligence. The ongoing experiment serves as a reflection of the current state of AI research, emphasizing both the advancements made and the significant hurdles that remain.
- Claude 3.7 Sonnet shows improved reasoning but struggles with basic gameplay in Pokémon.
- The model operates without specific training for the game, relying on general knowledge.
- It excels in text-based interactions but has difficulty interpreting low-resolution graphics.
- Memory limitations lead to repeated mistakes and challenges in retaining learned information.
- The experiment highlights both advancements in AI and the significant challenges that remain.
Related
Claude 3.5 Sonnet
Claude 3.5 Sonnet, the latest in the model family, excels in customer support, coding, and humor comprehension. It introduces Artifacts on Claude.ai for real-time interactions, prioritizing safety and privacy. Future plans include Claude 3.5 Haiku and Opus, emphasizing user feedback for continuous improvement.
Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.
OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems
OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.
Claude 3.7 Sonnet and Claude Code
Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model improving coding tasks and web development. It includes Claude Code for automation, maintains previous pricing, and enhances safety with fewer refusals.
People are using Super Mario to benchmark AI now
Researchers at UC San Diego's Hao AI Lab are using Super Mario Bros. to benchmark AI performance, finding Anthropic's Claude 3.7 superior, while raising questions about gaming skills' relevance to real-world applications.
But these models already know all this information??? Surely it's ingested Bulbapedia, along with a hundred zillion terabytes of every other Pokemon resource on the internet, so why does it need to squirrel this information away? What's the point of ruining the internet with all this damn parasitic crawling if the models can't even recall basic facts like "thunderbolt is an electric-type move", "geodude is a rock-type pokemon", "electric-type moves are ineffective against rock-type pokemon"?
This suggests a fundamental issue beyond just navigation. While accessing more RAM data or using external tools using said data could improve consistency or get it further, that approach reduces the extent to which Claude is independently playing and reasoning.
A more effective solution would enhance its decision-making without relying on direct RAM access or any kind of fine tuning. I'm sure it's possible.
There has to be a better approach, and also in a way that's not relying on reading values from RAM or any kind of fine tuning.
But here's what strikes me as odd about the whole approach. Everyone knows it's easier for an LLM to write a program that counts the number of R's in "strawberry" than it is to count it directly, yet no one is leveraging this fact.
Instead of ever more elaborate prompts and contexts, why not ask Claude to write a program for mapping and path finding? Hell if that's too much to ask, maybe make the tools before hand and see if it's at least smart enough to use them effectively.
My personal wishlist is things like fact tables, goal graphs, and a word model - where things are and when things happened. Strategies and hints. All these things can be turned into formal systems. Something as simple as a battle calculator should be a no-brainer.
My last hair brained idea - I would like to see LLMs managing an ontology in prolog as a proxy for reasoning.
This all theory and even if implemented wouldn't solve everything, but I'm tired of watching them throw prompts at the wall in hopes that the LLM can be tricked into being smarter than it is.
There are dudes on YouTube who get millions of views doing basic reinforcement learning to train weird shapes to navigate obstacle course, win races, learn to walk, etc. But they do this by making a custom model with inputs and outputs that are directly mapped into the “physical world” which these creatures live.
Until these LLM’s have input and output parameters that specifically wire into “front distance sensor or leg pressure sensor” and “leg muscles or engine speed” they are never going to be truly good at tasks requiring such interaction.
Any such attempt that lacks such inputs and outputs and somehow manages to have passable results will be in spite of the model not because of it. They’ll always get their ass kicked by specialized models trained for such tasks on every dimension including efficiency, power, memory use, compute and size.
And that is the thing, despite their incredible capabilities, LLM’s are not AGI and they are not general purpose models either! And they never will be. And that is just fine.
> We built the text side of it first, and the text side is definitely... more powerful. How these models can reason about images is getting better, but I think it's a decent bit behind.
This seems to be the main issue: using an AI model predominantly trained on text-based reasoning to play through a graphical video game challenge. Given this context, the image processing for this model is like an underdeveloped skill compared to its text-based reasoning. Even though it spent an excessive amount of time navigating through Mt. Moon or getting trapped in small areas of the map, Claude will likely only get better at playing Pokemon or other games as it's trained on more image-related tasks and the model balances out its capabilities.
It's an amazing search engine, and has really cool suggestions/ideas. It's certainly very useful. But replace a human? Get real. All these stocks are going to tank once the media starts running honest articles instead of PR fluff.
Of course LLMs can't play Pokemon long term. Used this way, they're the AI equivalent of someone who has the beginnings of dementia, who knows it, and is compensating by metaphorically sitting in a corner rocking to themselves repeating things so they don't forget them, because all they can remember is what they said.
It's amazing what they can do. I'm not denying anything they have actually done, because things that have been concretely been done, are done.
But the AIs that will actually fulfill the promises of this generation of AI are the ones yet to come, that incorporate some sort of memory-analog that isn't just "I repeated something to myself inside my token window", and some sort of symbolic manipulation capability.
In both cases, I'm using "some sort of" very broadly. I don't know what that looks like any more than anyone else. For instance, to obtain "symbolic manipulation capability" I don't necessarily mean hooking up a symbolic manipulation package. Humans can manipulate symbols but we clearly do it in a somewhat inefficient and often incorrect way with our neural nets. Even so, we get a great deal of benefit from it. We don't have integrated symbolic manipulation packages, and using the ones that exist is actually a rather rare and difficult skill. But what LLMs do is really quite different from either what humans or software packages do.
However I think it's pretty clear that LLMs aren't really going to get much farther than they have now in terms of raw capability. People will continue to find clever ways to use them, of course, but the raw capabilities of LLMs qua LLMs are probably pretty close to tapped out.
I expect in the future that what we today call "LLMs" will become a component of a system that parses text into vectors and then feeds those vectors into what we consider the "real" AI as those vectors, and the AI will in turn emit vectors back into the LLM module that will be converted to human speech. And I rather suspect those LLMs, since they won't be the thing trying to do the "real work", will actually be quite a bit smaller than they are today, with the bulk of the computational power being allocated to the "real" AI. We won't be hypertrophying the LLM layer to try to provide a poor and tiny memory and maintain the state of whatever is happening because the "real" AI layer will be doing that. The future will probably laugh at us sitting here trying to make the language center bigger and bigger and bigger when obviously the correct answer was to... do whatever it is they did in their past but our future.
My guess is that image understanding will improve.
I am baffled that anyone is still buying this complete nonsense. Like, come on.
Related
Claude 3.5 Sonnet
Claude 3.5 Sonnet, the latest in the model family, excels in customer support, coding, and humor comprehension. It introduces Artifacts on Claude.ai for real-time interactions, prioritizing safety and privacy. Future plans include Claude 3.5 Haiku and Opus, emphasizing user feedback for continuous improvement.
Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.
OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems
OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.
Claude 3.7 Sonnet and Claude Code
Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model improving coding tasks and web development. It includes Claude Code for automation, maintains previous pricing, and enhances safety with fewer refusals.
People are using Super Mario to benchmark AI now
Researchers at UC San Diego's Hao AI Lab are using Super Mario Bros. to benchmark AI performance, finding Anthropic's Claude 3.7 superior, while raising questions about gaming skills' relevance to real-world applications.