paper:doi-10-23915-distill-00024-001Zoom In: An Introduction to Circuits
TL;DR
The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.
What to take away
- 1. Curve detectors in InceptionV1 layer mixed3b respond to curved boundaries with a radius of approximately 60 pixels and are organized into orientation-tiling families that jointly span the full 360 degrees of possible orientations.
- 2. The curve detector circuit is directly readable off the 5×5 convolution weights: positive weights are arranged in the shape of the curve, implementing tangent-curve detection at each point, while detectors in opposing orientations supply inhibitory weights.
- 3. A four-layer circuit for oriented dog-head detection implements an XOR-like computation by maintaining separate left-facing and right-facing pathways that mutually inhibit each other before converging into pose-invariant union neurons.
- 4. InceptionV1 neuron 4e:55 is polysemantic, responding to cat faces, fronts of cars, and cat legs — three unrelated stimulus classes — demonstrating that polysemanticity is a systematic challenge rather than an isolated anomaly.
- 5. A superposition circuit in mixed4c shows that the network takes a pure car-detector neuron and deliberately distributes its representation across neurons that primarily encode dog features, apparently to store more features than there are neurons by exploiting near-orthogonality in high-dimensional space.
- 6. Curve detectors and high-low frequency detectors were observed anecdotally across at least four architectures — AlexNet, InceptionV1, VGG19, and ResNetV2-50 — as well as in models trained on Places365 instead of ImageNet, providing preliminary support for the universality hypothesis.
- 7. The handwritten circuit methodology — cleanroom reimplementation by hand-setting all weights based on a mechanistic understanding of a feature — serves as a falsifiability test: if the reimplemented weights replicate the original neuron's behavior, the proposed algorithm is confirmed.
- 8. Seven convergent arguments are introduced for establishing that a neuron detects what is claimed: feature visualization, dataset examples, synthetic examples, joint tuning curves, weight-reading (circuit implementation), downstream client analysis, and handwritten circuit reimplementation.
- 9. An open hypothesis is raised that high-low frequency detectors — which detect low-frequency patterns on one side of the receptive field and high-frequency patterns on the other — may correspond to previously uncharacterized feature types in biological visual cortex, potentially offering a cross-domain prediction to test.
- 10. The paper raises the open question of whether polysemantic neurons can be resolved by 'unfolding' networks or by training objectives that penalize polysemanticity, framing this as equivalent to the disentangled representation learning problem but applied at the level of discriminative rather than generative model latent spaces.
Peer brief — for seminar discussion
Olah et al. (2020) introduce the Circuits framework as a program for mechanistic interpretability of neural networks, organized around three speculative claims: that features (defined as directions in a layer's activation vector space, often individual neurons) are the fundamental unit of neural networks and are understandable; that features are connected by weights forming circuits — computational subgraphs — that are also understandable; and that analogous features and circuits recur across models and tasks (the universality hypothesis). The empirical work is conducted primarily in InceptionV1, with anecdotal cross-architecture comparison spanning AlexNet, VGG19, ResNetV2-50, and models trained on Places365. The load-bearing finding is that network weights, when examined at the level of small circuits, encode legible algorithms: the 5×5 convolution weights between early and late curve detectors in InceptionV1 layer mixed3b literally spell out a tangent-curve detection procedure, with positive weights tracing the curve shape and inhibitory weights at opposing orientations. A four-layer dog-head circuit separately maintains left- and right-facing detection pathways with mutual inhibition before unioning them into pose-invariant responses. Polysemantic neuron 4e:55, responding to cat faces, car fronts, and cat legs, is presented as a core unsolved challenge, explained mechanistically through a superposition circuit in mixed4c where a pure car feature is spread across dog-detector neurons to exploit near-orthogonality in high-dimensional space. The method introduced for falsifiability is the handwritten circuit reimplementation, in which all weights are set by hand based on the proposed algorithm and the result compared to the original neuron's behavior; an alternative the paper does not adopt is activation patching or causal tracing, which tests circuit claims by intervening on activations rather than requiring complete weight reconstruction. The universality evidence is explicitly characterized as anecdotal, and this is the clearest target for pushback: the paper offers visual inspection across multiple architectures but no quantitative similarity metric, no statistical test, and no control for the possibility that correlated but mechanistically distinct features are being conflated — a concern the authors themselves raise and then leave unresolved. The paper's central prediction is that if universality holds broadly, a periodic table of visual features becomes achievable; it also speculatively proposes that high-low frequency detectors, undescribed in prior neuroscience literature, might be confirmable in biological visual cortex, which would constitute strong cross-domain evidence for universality.
Frameworks (2)
- Circuits ThreadAn open scientific collaboration hosted on Distill slack studying the inner workings of neural networks via zoomed-in mechanistic investigation
- Interpretability as Natural ScienceProposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Findings (7)
- InceptionV1 implements a four-layer circuit for pose-invariant dog head detection with mirrored left/right pathways that inhibit each other then unite, exhibiting XOR-like properties
Evidence that neural networks learn sophisticated invariance mechanisms through structured circuits rather than loose feature aggregation
- Weights between early and full curve detectors in InceptionV1 form a curve of positive weights at tangent positions, with opposing orientations inhibitory
Demonstrates that meaningful algorithms can be read directly off floating-point weights in a neural network
- InceptionV1 spreads car feature from a pure car detector in mixed4c across dog detector neurons in the next layer
Circuit-level evidence that polysemantic neurons arise deliberately through superposition rather than entangled computation
- Curve detectors found across AlexNet, InceptionV1, VGG19, ResNetV2-50 and models trained on Places365
Anecdotal evidence for the universality of low-level visual features across different architectures and datasets
- InceptionV1 neuron 4e:55 responds to cat faces, fronts of cars, and cat legs as unrelated stimuli
Concrete example of polysemantic neuron demonstrating the challenge to the circuits agenda
- High-low frequency detectors found across AlexNet, InceptionV1, VGG19, and ResNetV2-50
Second low-level feature type demonstrating cross-architecture universality
- Curve detecting neurons found in every non-trivial vision model carefully examined
Empirical basis for treating curve detectors as a canonical example of meaningful, understandable features
Claims (14)
- Polysemantic neurons are a major challenge for the circuits agenda, because N meanings in one neuron times M in another creates NxM effective connections that cannot be considered individually.
Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
- If the universality hypothesis is broadly true, it raises the exciting possibility that artificial neural networks could predict features previously unknown in biological neural networks.
Speculative extension of universality to neuroscience, with high-low frequency detectors as a candidate prediction
- The typical case is that neurons (or other directions in activation space) are understandable after thousands of hours of study, even when initially mysterious.
Author's interpretive assertion based on extensive empirical investigation, countering texture-only skepticism
- Superposition exploits the geometry of high-dimensional spaces, which allow exponentially many almost-orthogonal vectors but only n strictly orthogonal ones.
Mechanistic explanation for why superposition is geometrically feasible
- Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
- Superposition is in some sense deliberate: the model converts pure neurons into polysemantic neurons to store more features in fewer neurons.
Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
- Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
- Qualitative research results can change the world: the discovery of cells was qualitative, just as interpretability research is today.
Historical argument defending qualitative interpretability research against dismissal as unscientific
- In the long run, studying circuit motifs may be more important than studying individual circuits for understanding neural networks.
Strategic claim about the relative importance of motif-level abstraction over circuit-level analysis
- Circuit claims are falsifiable: if you understand a circuit, you should be able to predict what changes when you edit the weights.
Argument that circuits methodology meets natural-science standards of falsifiability
Hypotheses (2)
- We hypothesize that high-low frequency detectors, if predicted by artificial neural network universality, might be found in biological neural networks.
Specific cross-domain prediction mentioned by neuroscientists in conversation with the authors
- We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.
Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome
Questions (6)
- What kind of picture of neural networks would emerge if we treated individual neurons, even individual weights, as being worthy of serious investigation?
Central motivating question for the circuits research program
- Lack of rigorous cross-model comparison demonstrating that specific named features (not just correlated ones) form across architectures
Explicitly identified research gap: anecdotal evidence exists but rigorous characterization is absent
- Is deep learning at a similar, albeit more modest, transition point as the invention of the microscope?
Motivating analogy question framing the circuits agenda as a potential paradigm shift in interpretability
- Whether overall model behavior can be broken down into statements about circuits remains undemonstrated
Identified gap: circuits are small-scope; linking them to model-level behavior requires future work
- Is the apparent universality of some low-level vision features the exception or the rule?
Open empirical question following anecdotal cross-model universality findings
- No established method for resolving polysemantic neurons into pure features at scale
Identified gap linking polysemanticity challenge to disentangled representations literature
Original abstract (expand)
This article introduces the 'circuits' approach to understanding neural networks by zooming in on individual neurons and their connections, analogous to how microscopes revealed cells in biology. The authors present three speculative claims: (1) features are the fundamental unit of neural networks corresponding to directions in activation space, (2) features are connected by weights forming circuits that implement meaningful algorithms, and (3) analogous features and circuits form across models and tasks. Through examples like curve detectors, dog head detectors, and car detection circuits, the paper argues that neural networks are surprisingly interpretable at fine-grained levels of analysis.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Circuit Insights: Towards Interpretability Beyond ActivationsAakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin Elena Golimblevskaia2026≈ 85%
- Certified Circuits: Stability Guarantees for Mechanistic CircuitsTobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer Alaa Anani2026≈ 84%
- Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision TransformersWolfgang Stammer, Bernt Schiele, Jonas Fischer Nina \.Zukowska2026≈ 84%
- ≈ 83%
- Automatically Identifying Local and Global Circuits with Linear Computation GraphsFukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu Xuyang Ge2024≈ 83%
- How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model CircuitsMichael Li and Nishant Subramani2026≈ 83%
- Adaptive Circuit Behavior and Generalization in Mechanistic InterpretabilitySankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen Jatin Nainani2024≈ 82%
- Finding Interpretable Prompt-Specific Circuits in Language ModelsLucas M. Tassis, Azalea Rohr, Mark Crovella Gabriel Franco2026≈ 82%
- Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language ModelsZhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic Liwei Che2026≈ 82%
- Beyond Components: Singular Vector-Based Interpretability of Transformer CircuitsAreeb Ahmad and Abhinav Joshi and Ashutosh Modi2025≈ 82%
- Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable GuaranteesGuy Katz, Shahaf Bassan Itamar Hadad2026≈ 82%
- ≈ 82%
- Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language ModelsYonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek Dana Arad2025≈ 82%
- Quantifying LLM Attention-Head Stability: Implications for Circuit UniversalityJack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali2026≈ 82%
- Interpreting Neural Networks through the Polytope LensLee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy Sid Black2022≈ 82%
- The computational boundary of a 'self': developmental bioelectricity drives multicellularity and scale-free cognitionin corpus2019≈ 81%
- The World Inside Neural Networksin corpus2026≈ 80%
- ≈ 80%
- ≈ 80%
- ≈ 80%
- A Mathematical Framework for Transformer Circuitsin corpus2021≈ 79%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 79%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 79%
- Learning without neurons in physical systemsin corpus2022≈ 78%
- Self-Improvising Memory: A Perspective on Memories as Agential, Dynamically Reinterpreting Cognitive Gluein corpus2024≈ 78%
- Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studiesin corpus2023≈ 78%
- Collective intelligence: A unifying concept for integrating biology across scales and substratesin corpus2024≈ 78%
- Darwin's agential materials: evolutionary implications of multiscale competency in developmental biologyin corpus2023≈ 78%
- The biogenic approach to cognitionin corpus2005≈ 77%
- ≈ 66%
+29 more
Similar preprints — Semantic Scholar
Cited by (2)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint