March 31st, 2025

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

The paper presents a method called circuit tracing to enhance understanding of language models, using attribution graphs to visualize feature interactions, validated through case studies, while noting limitations and future research potential.

Read original articleLink Icon
Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

The paper introduces a novel method for understanding the inner workings of language models through a technique called circuit tracing. This approach involves creating graph representations of the model's computations by substituting complex components with a more interpretable "cross-layer transcoder" model. The authors focus on generating "attribution graphs" that illustrate how specific features interact to produce outputs for given prompts. They validate their findings through various case studies, including factual recall and addition tasks, demonstrating the effectiveness of their method in revealing the mechanisms behind model behaviors. The paper also discusses the limitations of their approach, such as the challenges posed by attention patterns and the complexity of global circuits. The authors emphasize the potential for future research to refine these methods and explore additional model mechanisms. Overall, the study aims to enhance the interpretability of language models, laying the groundwork for further investigations into their operational principles.

- The study presents a method for uncovering the mechanisms of language models using circuit tracing.

- Attribution graphs are created to visualize the interactions of features in model computations.

- The approach is validated through case studies, highlighting its effectiveness in understanding model behaviors.

- Limitations include challenges with attention patterns and the complexity of global circuits.

- Future research is encouraged to refine these methods and explore additional mechanisms in language models.

Link Icon 4 comments
By @bob1029 - 1 day
> Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial “neurons”). The field of mechanistic interpretability seeks to describe these transformations in human-understandable language.

This is the central theme behind why I find techniques like genetic programming to be so compelling. You get interpretability by default. The second order effect of this seems to be that you can generalize using substantially less training data. The humans developing the model can look inside the box and set breakpoints, inspect memory, snapshot/restore state, follow the rabbit, etc.

The biggest tradeoff here being that the search space over computer programs tends to be substantially more rugged. You can't use math tricks to cheat the computation. You have to run every damn program end-to-end and measure the performance of each directly. However, you can execute linear program tapes very, very quickly on modern x86 CPUs. You can search through a billion programs with a high degree of statistical certainty in a few minutes. I believe we are at a point where some of the ideas from the 20th century are viable again.

By @ironbound - 1 day
For people new to this maybe check out this video, it explains how the internals run pretty quickly https://m.youtube.com/watch?v=UKcWu1l_UNw

In theory if Anthropic puts research into the mechanics of the models internals, we can get better returns in training and alignment.

By @somethingsome - 1 day
Is the pdf available somewhere?