Chris Olah

Co-author; provided high-level research guidance, wrote introduction/discussion.

openalex A5039751155 name_hash 158a0d465e846c466e192592…

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (3)

A Mathematical Framework for Transformer Circuits2021
Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.
Zoom In: An Introduction to Circuits2020ⓒ 252
The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.
Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet
referenced-only

More papers — OpenAlex / S2

Originates (3)

method

Olah et al. Computer Vision Model Analysis

concept

Universality Hypothesis Induction Heads

Studies (2)

Neural Network Interpretability Circuits Thread

Affiliations (2)

Anthropic(institute)
OpenAI(institute)

Co-authors (12)

Shan Carter5 shared
Gabriel Goh4 shared
Ludwig Schubert4 shared
Michael Petrov4 shared
Nick Cammarata4 shared
Tom Henighan2 shared
Adam Jermyn1 shared
Adam Pearce1 shared
Adly Templeton1 shared
Andy Jones1 shared
Brian Chen1 shared
C. Daniel Freeman1 shared

Their work is cited by (8)

Other inbound relations (3)

citescimcWhitepaper(paper)
citesThe Machine Consciousness Hypothesis(paper)
mentionsFrom Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs(paper)

Recent mentions (7)

papers-typed
yu-2025-directions-cones.md
papers-typed
denis-2025-linear-representation.md
machine-consciousness
The Machine Consciousness Hypothesis.md
machine-consciousness
cimcWhitepaper.md
papers
mathematical.md
papers-typed
olah-2020-introduction.md
papers
scaling.md