thinker:chris-olahChris Olah
Co-author; provided high-level research guidance, wrote introduction/discussion.
Authored papers (3)
Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.
The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.
More papers — OpenAlex / S2
Originates (3)
Co-authors (12)
- Shan Carter5 shared
- Gabriel Goh4 shared
- Ludwig Schubert4 shared
- Michael Petrov4 shared
- Nick Cammarata4 shared
- Tom Henighan2 shared
- Adam Jermyn1 shared
- Adam Pearce1 shared
- Adly Templeton1 shared
- Andy Jones1 shared
- Brian Chen1 shared
- C. Daniel Freeman1 shared
Their work is cited by (8)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs2× refs
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations2× refs
- Endogenous Resistance to Activation Steering in Language Models2× refs
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders2× refs
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?1× refs
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models1× refs
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks1× refs
- Unveiling the Latent Directions of Reflection in Large Language Models1× refs
Other inbound relations (3)
- citescimcWhitepaper(paper)
- citesThe Machine Consciousness Hypothesis(paper)
- mentionsFrom Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs(paper)
Recent mentions (7)
- papers-typedyu-2025-directions-cones.md
- papers-typeddenis-2025-linear-representation.md
- machine-consciousnessThe Machine Consciousness Hypothesis.md
- machine-consciousnesscimcWhitepaper.md
- papersmathematical.md
- papers-typedolah-2020-introduction.md
- papers
scaling.md