thinker:shan-carterShan Carter
Co-author; managed interpretability team, guided visual style.
Authored papers (3)
The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.
More papers — OpenAlex / S2
Affiliations (1)
- Anthropic(institute)
Co-authors (12)
- Chris Olah5 shared
- Gabriel Goh4 shared
- Ludwig Schubert4 shared
- Michael Petrov4 shared
- Nick Cammarata4 shared
- Adam Jermyn2 shared
- Adly Templeton2 shared
- Joshua Batson2 shared
- Tom Henighan2 shared
- Trenton Bricken2 shared
- Adam Pearce1 shared
- Andy Jones1 shared
Their work is cited by (6)
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations2× refs
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs2× refs
- Endogenous Resistance to Activation Steering in Language Models2× refs
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders2× refs
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation1× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior1× refs
Recent mentions (3)
- papers-typedolah-2020-introduction.md
- papers
towards.md - papers
scaling.md