thinker
active
thinker:shan-carter

Shan Carter

Co-author; managed interpretability team, guided visual style.

Authored
3
Introduces
0
Studies
0
Affiliations
1
Cited by
6

Authored papers (3)

  • The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (12)

Recent mentions (3)