Shan Carter

Co-author; managed interpretability team, guided visual style.

openalex A5087762412 name_hash 47b36185f41d601a6b60feaf…

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (3)

Towards monosemanticity: Decomposing language models with dictionary learning2023
referenced-only
Zoom In: An Introduction to Circuits2020ⓒ 252
The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.
Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet
referenced-only

More papers — OpenAlex / S2

Affiliations (1)

Anthropic(institute)

Co-authors (12)

Chris Olah5 shared
Gabriel Goh4 shared
Ludwig Schubert4 shared
Michael Petrov4 shared
Nick Cammarata4 shared
Adam Jermyn2 shared
Adly Templeton2 shared
Joshua Batson2 shared
Tom Henighan2 shared
Trenton Bricken2 shared
Adam Pearce1 shared
Andy Jones1 shared

Their work is cited by (6)

Recent mentions (3)

papers-typed
olah-2020-introduction.md
papers
towards.md
papers
scaling.md