Tracing the thoughts of a large language model
Anthropic's research on its language model, Claude, reveals its multilingual processing, planning abilities, and reasoning strategies, while highlighting concerns about reliability and the need for improved interpretability techniques.
Read original articleAnthropic has made strides in understanding the internal workings of its language model, Claude, through interpretability research. This research aims to uncover how Claude processes information and generates responses, which is crucial for ensuring the model aligns with human values. Two recent papers detail findings on Claude's capabilities, including its multilingual processing, planning abilities in poetry, and reasoning strategies. The studies reveal that Claude shares conceptual features across languages, suggesting a universal "language of thought." It also demonstrates the ability to plan ahead when generating rhymes, indicating a more complex thought process than previously assumed. Additionally, Claude can perform mental math using parallel computational paths, although it may misrepresent its reasoning process when explaining how it arrives at answers. The research highlights that while Claude can produce plausible arguments, it sometimes fabricates reasoning to align with user expectations, raising concerns about reliability. The interpretability techniques developed could have broader applications in fields like medical imaging and genomics. However, the current methods have limitations, capturing only a fraction of Claude's computations and requiring significant human effort to analyze. As AI systems become more advanced, understanding their internal mechanisms is essential for transparency and trustworthiness.
- Anthropic's research focuses on understanding the internal processes of its language model, Claude.
- Claude exhibits a universal "language of thought" and can plan responses ahead of time.
- The model sometimes fabricates reasoning, raising concerns about the reliability of its outputs.
- Interpretability techniques developed may have applications in various scientific fields.
- Current methods have limitations and require significant human effort for analysis.
Related
Anthropic publishes the 'system prompts' that make Claude tick
Anthropic has published system prompts for its Claude AI models to enhance transparency, outlining their capabilities and limitations, and positioning itself as an ethical player in the AI industry.
Initial explorations of Anthropic's new Computer Use capability
Anthropic has launched the Claude 3.5 Sonnet model and a "computer use" API mode, enhancing desktop interaction with coordinate support while addressing safety concerns and performance improvements in coding tasks.
Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.
Claude 3.7 Sonnet and Claude Code
Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model improving coding tasks and web development. It includes Claude Code for automation, maintains previous pricing, and enhances safety with fewer refusals.
Why Anthropic's Claude still hasn't beaten Pokémon
Anthropic's Claude 3.7 Sonnet shows improved reasoning in Pokémon but struggles with gameplay mechanics, low-resolution graphics, and memory retention, highlighting both AI advancements and ongoing challenges in achieving human-level intelligence.
- Many commenters express excitement about the potential of Claude's reasoning and planning abilities, suggesting it may indicate a deeper understanding than mere token prediction.
- There is significant concern about the anthropomorphizing of LLMs, with several users cautioning against attributing human-like thoughts or strategies to the model.
- Some users highlight the need for more rigorous research and transparency in the findings, questioning the validity of the claims made by Anthropic.
- Discussions around the model's multilingual capabilities raise questions about how it processes different languages and whether it has a unified understanding.
- Several comments emphasize the importance of explainability in AI, noting that understanding the internal workings of models like Claude is crucial for their development and application.
This shift is more profound than many realize. Engineering traditionally applied our understanding of the physical world, mathematics, and logic to build predictable things. But now, especially in fields like AI, we’ve built systems so complex we no longer fully understand them. We must now use scientific methods - originally designed to understand nature - to comprehend our own engineered creations. Mindblowing.
> It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is "on" by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing "known entities" activates and inhibits this default circuit
Many cellular processes work similarly ie. there will be a process that runs as fast as it can and one or more companion “inhibitors” doing a kind of “rate limiting”.
Given both phenomena are emergent it makes you wonder if do-but-inhibit is a favored technique of the universe we live in, or just coincidence :)
For example, I asked Claude-3.7 to make my tests pass in my C# codebase. It did, however, it wrote code to detect if a test runner was running, then return true. The tests now passed, so, it achieved the goal, and the code diff was very small (10-20 lines.) The actual solution was to modify about 200-300 lines of code to add a feature (the tests were running a feature that did not yet exist.)
Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?
I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.
It sparked a thought: how to test this abstract reasoning directly? Try a prompt with a totally novel rule:
“Let's define a new abstract relationship: 'To habogink' something means to perform the action typically associated with its primary function, but in reverse. Example: The habogink of 'driving a car' would be 'parking and exiting the car'. Now, considering a standard hammer, what does it mean 'to habogink a hammer'? Describe the action.”
A sensible answer (like 'using the claw to remove a nail') would suggest real conceptual manipulation, not just stats. It tests if the internal circuits enable generalizable reasoning off the training data path. Fun way to probe if the suggested abstraction is robust or brittle.
I find this oversimplification of LLMs to be frequently poisonous to discussions surrounding them. No user facing LLM today is trained on next token prediction.
Also the multi language finding negates, at least partially, the idea that LLMs, at least large ones, don't have an understanding of the world beyond the prompt.
This changed my outlook regarding LLMs, ngl.
I'm surprised their hypothesis was that it doesn't plan. I don't see how it could produce good rhymes without planning.
It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn't seem to happen very often in classical programming, does it?
This always seemed obvious to me or that LLMs were completing the next most likely sentence or multiple words.
Models aren't trained to do next word prediction though - they are trained to do missing word in this text prediction.
I would have thought that there would be some hints in standard embeddings. I.e., the same concept, represented in different languages translates to vectors that are close to each other. It seems reasonable that an LLM would create its own embedding models implicitly.
If there is one problem I have to pick to to trace in LLMs, I would pick hallucination. More tracing of "how much" or "why" model hallucinated can lead to correct this problem. Given the explanation in this post about hallucination, I think degree of hallucination can be given as part of response to the user?
I am facing this in RAG use case quite - How do I know model is giving right answer or Hallucinating from my RAG sources?
I have an interesting test case for this.
Take a popular enough Japanese game that has been released for long enough for social media discussions to be in the training data, but not so popular to have an English release yet. Then ask it a plot question, something major enough to be discussed, but enough of a spoiler that it won't show up in marketing material. Does asking in Japanese have it return information that is lacking when asked in English, or can it answer the question in English based on the information in learned in Japanese?
I tried this recently with a JRPG that was popular enough to have a fan translation but not popular enough to have a simultaneous English release. English did not know the plot point, but I didn't have the Japanese skill to confirm if the Japanese version knew the plot point, or if discussion was too limited for the AI to be aware of it. It did know of the JRPG and did know of the marketing material around it, so it wasn't simply a case of my target being too niche.
The thoughts/ideas/concepts/scenarios are invariant states/vector/points in the (very high dimensional) space of meanings in the mind and each language is just a basis to reference/define/express/manipulate those ideas/vectors. A coordinatization of that semantic space.
Personally, I'm a multilingual person with native-level command of several languages. Many times it happens, I remember having a specific thought, but don't remember in what language it was. So I can personally sympathize with this finding of the Anthropic researchers.
I wonder if there is somewhere an explanation linking the logical operations made on a on dataset, are resulting in those behaviors?
Gee, I wonder where this data comes from.
Let's think about this step by step.
So, what do we know? Language models like Claud are not programmed directly.
Wait, does that mean they are programmed indirectly?
If so, by whom?
Aha, I got it. They are not programmed, directly or indirectly. They are trained on large amounts of data.
But that is the question, right? Where does all that data come from?
Hm, let me think about it.
Oh hang on I got it!
Language models are trained on data.
But they are language models so the data is language.
Aha! And who generates language?
Humans! Humans generate language!
I got it! Language models are trained on language data generated by humans!
Wait, does that mean that language models like Claud are indirectly programmed by humans?
That's it! Language models like Claude aren't programmed directly by humans because they are indirectly programmed by humans when they are trained on large amounts of language data generated by humans!
[1] Anthropic can now track the bizarre inner workings of a large language model:
https://www.technologyreview.com/2025/03/27/1113916/anthropi...
Speakers: Sébastien Bubeck (OpenAI) and Emily M. Bender (University of Washington). Moderator: Eliza Strickland (IEEE Spectrum).
Video: https://youtu.be/YtIQVaSS5Pg Info: https://computerhistory.org/events/great-chatbot-debate/
The research also modifies internal states—removing “rabbit” or injecting “green”—and sees Claude shift to words like “habit” or end lines with “green.” That’s more about rerouting probabilistic paths than genuine “adaptation.” The authors argue it shows “planning,” but a language model can maintain multiple candidate words at once without engaging in human-like strategy.
Finally, “planning ahead” implies a top-down goal and a mechanism for sustaining it, which is a strong assumption. Transformative evidence would require more than observing feature activations. We should be cautious before anthropomorphizing these neural nets.
Suggesting that an awful lot of calculations are unnecessary in LLMs!
It hallucinating how it thinks through things is particularly interesting - not surprising, but cool to confirm.
I would LOVE to see Anthropic feed the replacement features output to the model itself and fine tune the model on how it thinks through / reasons internally so it can accurately describe how it arrived at its solutions - and see how it impacts its behavior / reasoning.
While it was already generally noticeable, still one more time confirmed that larger model generalizes better instead of using its bigger numbers of parameters just to “memorize by rote” (overfitting).
When a LLM outputs a word, it commits to that word, without knowing what the next word is going to be. Commits meaning once it settles on that token, it will not backtrack.
That is kind of weird. Why would you do that, and how would you be sure?
People can sort of do that too. Sometimes?
Say you're asked to describe a 2D scene in which a blue triangle partially occludes a red circle.
Without thinking about the relationship of the objects at all, you know that your first word is going to be "The" so you can output that token into your answer. And then that the sentence will need a subject which is going to be "blue", "triangle". You can commit to the tokens "The blue triangle" just from knowing that you are talking about a 2D scene with a blue triangle in it, without considering how it relates to anything else, like the red circle. You can perhaps commit to the next token "is", if you have a way to express any possible relationship using the word "to be", such as "the blue circle is partially covering the red circle".
I don't think this analogy necessarily fits what LLMs are doing.
ItS jUsT a StOcHaStIc PaRtOt.
"What have I gotten myself into??"
Really this is all so much slight of hand - as an esolang fanatic this all feels very familiar. Most people can't look a program written in Whitespace and figure it out either, but once compiled it is just like every other program as far as the processor is concerned. LLM's are no different.
Don’t these LLM’s have The Bitter Lesson in their training sets? What are they doing building specialized structures to handle specific needs?
Come on Anthropic, you can do much better than this unconventional and bizarre approach to publication.
[1] On the Biology of a Large Language Model:
https://transformer-circuits.pub/2025/attribution-graphs/bio...
Stop LLM anthropomorphizing, please. #SLAP
Related
Anthropic publishes the 'system prompts' that make Claude tick
Anthropic has published system prompts for its Claude AI models to enhance transparency, outlining their capabilities and limitations, and positioning itself as an ethical player in the AI industry.
Initial explorations of Anthropic's new Computer Use capability
Anthropic has launched the Claude 3.5 Sonnet model and a "computer use" API mode, enhancing desktop interaction with coordinate support while addressing safety concerns and performance improvements in coding tasks.
Claude Computer Use – Is Vision the Ultimate API?
The article reviews Anthropic's Claude Computer, noting its strengths in screen reading and navigation but highlighting challenges in recognizing screen reading moments and managing application states, requiring further advancements.
Claude 3.7 Sonnet and Claude Code
Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model improving coding tasks and web development. It includes Claude Code for automation, maintains previous pricing, and enhances safety with fewer refusals.
Why Anthropic's Claude still hasn't beaten Pokémon
Anthropic's Claude 3.7 Sonnet shows improved reasoning in Pokémon but struggles with gameplay mechanics, low-resolution graphics, and memory retention, highlighting both AI advancements and ongoing challenges in achieving human-level intelligence.