March 31st, 2025

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

The paper presents a method called circuit tracing to enhance understanding of language models, using attribution graphs to visualize feature interactions, validated through case studies, while noting limitations and future research potential.

Read original article

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

The paper introduces a novel method for understanding the inner workings of language models through a technique called circuit tracing. This approach involves creating graph representations of the model's computations by substituting complex components with a more interpretable "cross-layer transcoder" model. The authors focus on generating "attribution graphs" that illustrate how specific features interact to produce outputs for given prompts. They validate their findings through various case studies, including factual recall and addition tasks, demonstrating the effectiveness of their method in revealing the mechanisms behind model behaviors. The paper also discusses the limitations of their approach, such as the challenges posed by attention patterns and the complexity of global circuits. The authors emphasize the potential for future research to refine these methods and explore additional model mechanisms. Overall, the study aims to enhance the interpretability of language models, laying the groundwork for further investigations into their operational principles.

- The study presents a method for uncovering the mechanisms of language models using circuit tracing.

- Attribution graphs are created to visualize the interactions of features in model computations.

- The approach is validated through case studies, highlighting its effectiveness in understanding model behaviors.

- Limitations include challenges with attention patterns and the complexity of global circuits.

- Future research is encouraged to refine these methods and explore additional mechanisms in language models.

Can Large Language Models Understand Symbolic Graphics Programs?

The study evaluates large language models' understanding of symbolic graphics programs, introducing a benchmark and Symbolic Instruction Tuning to enhance reasoning and instruction-following capabilities in visual content comprehension.

Multimodal Interpretability in 2024

In 2024, multimodal interpretability is focusing on mechanistic methods, with circuit-based approaches and the TEXTSPAN algorithm enhancing understanding of Vision-Language Models, while addressing challenges in text-based interpretations.

Circuit Tracing: Revealing Computational Graphs in Language Models

The paper presents a novel circuit tracing method to enhance language model interpretability by creating attribution graphs, validating findings through case studies, and suggesting future research directions for advanced models.

Tracing the thoughts of a large language model

Anthropic's research on its language model, Claude, reveals its multilingual processing, planning abilities, and reasoning strategies, while highlighting concerns about reliability and the need for improved interpretability techniques.

The Biology of a Large Language Model

The study explores Claude 3.5 Haiku's internal mechanisms using circuit tracing, introducing attribution graphs to analyze its reasoning. It highlights advanced capabilities and acknowledges limitations, aiming to enhance AI interpretability.

4 comments

By @bob1029 - 1 day

> Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial “neurons”). The field of mechanistic interpretability seeks to describe these transformations in human-understandable language.

This is the central theme behind why I find techniques like genetic programming to be so compelling. You get interpretability by default. The second order effect of this seems to be that you can generalize using substantially less training data. The humans developing the model can look inside the box and set breakpoints, inspect memory, snapshot/restore state, follow the rabbit, etc.

The biggest tradeoff here being that the search space over computer programs tends to be substantially more rugged. You can't use math tricks to cheat the computation. You have to run every damn program end-to-end and measure the performance of each directly. However, you can execute linear program tapes very, very quickly on modern x86 CPUs. You can search through a billion programs with a high degree of statistical certainty in a few minutes. I believe we are at a point where some of the ideas from the 20th century are viable again.

By @ironbound - 1 day

For people new to this maybe check out this video, it explains how the internals run pretty quickly https://m.youtube.com/watch?v=UKcWu1l_UNw

In theory if Anthropic puts research into the mechanics of the models internals, we can get better returns in training and alignment.

By @somethingsome - 1 day

Is the pdf available somewhere?

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

Related

Can Large Language Models Understand Symbolic Graphics Programs?

Multimodal Interpretability in 2024

Circuit Tracing: Revealing Computational Graphs in Language Models

Tracing the thoughts of a large language model

The Biology of a Large Language Model

Related

Can Large Language Models Understand Symbolic Graphics Programs?

Multimodal Interpretability in 2024

Circuit Tracing: Revealing Computational Graphs in Language Models

Tracing the thoughts of a large language model

The Biology of a Large Language Model