paper
active
2020
252
paper:doi-10-23915-distill-00024-001

Zoom In: An Introduction to Circuits

TL;DR

The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.

What to take away

  1. 1. Curve detectors in InceptionV1 layer mixed3b respond to curved boundaries with a radius of approximately 60 pixels and are organized into orientation-tiling families that jointly span the full 360 degrees of possible orientations.
  2. 2. The curve detector circuit is directly readable off the 5×5 convolution weights: positive weights are arranged in the shape of the curve, implementing tangent-curve detection at each point, while detectors in opposing orientations supply inhibitory weights.
  3. 3. A four-layer circuit for oriented dog-head detection implements an XOR-like computation by maintaining separate left-facing and right-facing pathways that mutually inhibit each other before converging into pose-invariant union neurons.
  4. 4. InceptionV1 neuron 4e:55 is polysemantic, responding to cat faces, fronts of cars, and cat legs — three unrelated stimulus classes — demonstrating that polysemanticity is a systematic challenge rather than an isolated anomaly.
  5. 5. A superposition circuit in mixed4c shows that the network takes a pure car-detector neuron and deliberately distributes its representation across neurons that primarily encode dog features, apparently to store more features than there are neurons by exploiting near-orthogonality in high-dimensional space.
  6. 6. Curve detectors and high-low frequency detectors were observed anecdotally across at least four architectures — AlexNet, InceptionV1, VGG19, and ResNetV2-50 — as well as in models trained on Places365 instead of ImageNet, providing preliminary support for the universality hypothesis.
  7. 7. The handwritten circuit methodology — cleanroom reimplementation by hand-setting all weights based on a mechanistic understanding of a feature — serves as a falsifiability test: if the reimplemented weights replicate the original neuron's behavior, the proposed algorithm is confirmed.
  8. 8. Seven convergent arguments are introduced for establishing that a neuron detects what is claimed: feature visualization, dataset examples, synthetic examples, joint tuning curves, weight-reading (circuit implementation), downstream client analysis, and handwritten circuit reimplementation.
  9. 9. An open hypothesis is raised that high-low frequency detectors — which detect low-frequency patterns on one side of the receptive field and high-frequency patterns on the other — may correspond to previously uncharacterized feature types in biological visual cortex, potentially offering a cross-domain prediction to test.
  10. 10. The paper raises the open question of whether polysemantic neurons can be resolved by 'unfolding' networks or by training objectives that penalize polysemanticity, framing this as equivalent to the disentangled representation learning problem but applied at the level of discriminative rather than generative model latent spaces.

Peer brief — for seminar discussion

Olah et al. (2020) introduce the Circuits framework as a program for mechanistic interpretability of neural networks, organized around three speculative claims: that features (defined as directions in a layer's activation vector space, often individual neurons) are the fundamental unit of neural networks and are understandable; that features are connected by weights forming circuits — computational subgraphs — that are also understandable; and that analogous features and circuits recur across models and tasks (the universality hypothesis). The empirical work is conducted primarily in InceptionV1, with anecdotal cross-architecture comparison spanning AlexNet, VGG19, ResNetV2-50, and models trained on Places365. The load-bearing finding is that network weights, when examined at the level of small circuits, encode legible algorithms: the 5×5 convolution weights between early and late curve detectors in InceptionV1 layer mixed3b literally spell out a tangent-curve detection procedure, with positive weights tracing the curve shape and inhibitory weights at opposing orientations. A four-layer dog-head circuit separately maintains left- and right-facing detection pathways with mutual inhibition before unioning them into pose-invariant responses. Polysemantic neuron 4e:55, responding to cat faces, car fronts, and cat legs, is presented as a core unsolved challenge, explained mechanistically through a superposition circuit in mixed4c where a pure car feature is spread across dog-detector neurons to exploit near-orthogonality in high-dimensional space. The method introduced for falsifiability is the handwritten circuit reimplementation, in which all weights are set by hand based on the proposed algorithm and the result compared to the original neuron's behavior; an alternative the paper does not adopt is activation patching or causal tracing, which tests circuit claims by intervening on activations rather than requiring complete weight reconstruction. The universality evidence is explicitly characterized as anecdotal, and this is the clearest target for pushback: the paper offers visual inspection across multiple architectures but no quantitative similarity metric, no statistical test, and no control for the possibility that correlated but mechanistically distinct features are being conflated — a concern the authors themselves raise and then leave unresolved. The paper's central prediction is that if universality holds broadly, a periodic table of visual features becomes achievable; it also speculatively proposes that high-low frequency detectors, undescribed in prior neuroscience literature, might be confirmable in biological visual cortex, which would constitute strong cross-domain evidence for universality.

Frameworks (2)

  • Circuits Thread
    An open scientific collaboration hosted on Distill slack studying the inner workings of neural networks via zoomed-in mechanistic investigation
  • Interpretability as Natural Science
    Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies

Findings (7)

Claims (14)

Hypotheses (2)

Questions (6)

Original abstract (expand)

This article introduces the 'circuits' approach to understanding neural networks by zooming in on individual neurons and their connections, analogous to how microscopes revealed cells in biology. The authors present three speculative claims: (1) features are the fundamental unit of neural networks corresponding to directions in activation space, (2) features are connected by weights forming circuits that implement meaningful algorithms, and (3) analogous features and circuits form across models and tasks. Through examples like curve detectors, dog head detectors, and car detection circuits, the paper argues that neural networks are surprisingly interpretable at fine-grained levels of analysis.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+29 more

Similar preprints — Semantic Scholar

Cited by (2)