How do language models think—and can we trace their thoughts? Anthropic developed a method for visualizing Claude’s internal reasoning, revealing multi-step planning and shared concepts across languages. These findings offer a glimpse into how models organize and generate their responses.
We are excited to welcome the founder of AI Safety Tokyo, Blaine Rogers, to tell us all about Anthropic's latest interpretability research. Hope to see you there!
Understanding how large language models (LLMs) represent and manipulate concepts internally is essential for advancing interpretability and safety. This work introduces attribution graphs, a method for tracing how information flows between individual neurons in a transformer model. By applying this technique to Anthropic’s Claude models, the study uncovers evidence of structured and interpretable reasoning processes, including multi-step planning and concept representations shared across multiple languages.
Attribution graphs reveal not only how individual concepts are encoded but also how they interact and propagate through the network to influence model outputs. These findings offer new tools and insights for reverse-engineering language models and shed light on the internal dynamics of high-level reasoning in LLMs.
Lindsey, et al., "On the Biology of a Large Language Model", Transformer Circuits, 2025. Abstract by Claude.