Zoom In: An Introduction to Circuits

ByChris Olah·Nick Cammarata·Ludwig Schubert ⓘ·Gabriel Goh·Michael Petrov ⓘ·Shan CarterAnthropic, OpenAI

DOI 10.23915/distill.00024.001 OpenAlex W3010694149

Circuit Motif Circuits Thread Equivariant Circuit Interpretability as Natural Science Feature (neural network)Superposition Unioning Over Cases Universality Hypothesis

TL;DR

The Circuits framework proposes that neural network internals are legible at the level of individual neurons and their weighted connections, advancing three speculative claims: features (directions in activation space) are the fundamental unit, features connect via weights to form interpretable circuits, and analogous features and circuits recur across architectures. Working primarily in InceptionV1, the paper demonstrates these claims through three feature types — curve detectors in layer mixed3b responding to curved boundaries at ~60-pixel radius, high-low frequency detectors serving as object-boundary heuristics, and a pose-invariant dog head detector — and three circuits: the curve detector circuit readable directly off 5×5 convolution weights, a four-layer oriented dog-head detection circuit implementing XOR-like inhibition between left- and right-facing pathways before unioning them, and a superposition circuit in mixed4c where a pure car feature is deliberately spread across dog-detector neurons to conserve representable dimensions. The method introduced is the handwritten circuit reimplementation, in which weights are set by hand to reproduce a neuron's function as a falsifiability test. Universality is observed anecdotally across AlexNet, InceptionV1, VGG19, ResNetV2-50, and models trained on Places365, though the authors treat this as preliminary. The paper argues that if all three claims hold, mechanistic interpretability can be grounded as a natural science — with circuits serving as falsifiable, small-scope epistemic units that could eventually compose into full accounts of model behavior, analogous to cellular biology's role in zoology.

What to take away

1. Curve detectors in InceptionV1 layer mixed3b respond to curved boundaries with a radius of approximately 60 pixels and are organized into orientation-tiling families that jointly span the full 360 degrees of possible orientations.
2. The curve detector circuit is directly readable off the 5×5 convolution weights: positive weights are arranged in the shape of the curve, implementing tangent-curve detection at each point, while detectors in opposing orientations supply inhibitory weights.
3. A four-layer circuit for oriented dog-head detection implements an XOR-like computation by maintaining separate left-facing and right-facing pathways that mutually inhibit each other before converging into pose-invariant union neurons.
4. InceptionV1 neuron 4e:55 is polysemantic, responding to cat faces, fronts of cars, and cat legs — three unrelated stimulus classes — demonstrating that polysemanticity is a systematic challenge rather than an isolated anomaly.
5. A superposition circuit in mixed4c shows that the network takes a pure car-detector neuron and deliberately distributes its representation across neurons that primarily encode dog features, apparently to store more features than there are neurons by exploiting near-orthogonality in high-dimensional space.
6. Curve detectors and high-low frequency detectors were observed anecdotally across at least four architectures — AlexNet, InceptionV1, VGG19, and ResNetV2-50 — as well as in models trained on Places365 instead of ImageNet, providing preliminary support for the universality hypothesis.
7. The handwritten circuit methodology — cleanroom reimplementation by hand-setting all weights based on a mechanistic understanding of a feature — serves as a falsifiability test: if the reimplemented weights replicate the original neuron's behavior, the proposed algorithm is confirmed.
8. Seven convergent arguments are introduced for establishing that a neuron detects what is claimed: feature visualization, dataset examples, synthetic examples, joint tuning curves, weight-reading (circuit implementation), downstream client analysis, and handwritten circuit reimplementation.
9. An open hypothesis is raised that high-low frequency detectors — which detect low-frequency patterns on one side of the receptive field and high-frequency patterns on the other — may correspond to previously uncharacterized feature types in biological visual cortex, potentially offering a cross-domain prediction to test.
10. The paper raises the open question of whether polysemantic neurons can be resolved by 'unfolding' networks or by training objectives that penalize polysemanticity, framing this as equivalent to the disentangled representation learning problem but applied at the level of discriminative rather than generative model latent spaces.

Peer brief — for seminar discussion

Olah et al. (2020) introduce the Circuits framework as a program for mechanistic interpretability of neural networks, organized around three speculative claims: that features (defined as directions in a layer's activation vector space, often individual neurons) are the fundamental unit of neural networks and are understandable; that features are connected by weights forming circuits — computational subgraphs — that are also understandable; and that analogous features and circuits recur across models and tasks (the universality hypothesis). The empirical work is conducted primarily in InceptionV1, with anecdotal cross-architecture comparison spanning AlexNet, VGG19, ResNetV2-50, and models trained on Places365. The load-bearing finding is that network weights, when examined at the level of small circuits, encode legible algorithms: the 5×5 convolution weights between early and late curve detectors in InceptionV1 layer mixed3b literally spell out a tangent-curve detection procedure, with positive weights tracing the curve shape and inhibitory weights at opposing orientations. A four-layer dog-head circuit separately maintains left- and right-facing detection pathways with mutual inhibition before unioning them into pose-invariant responses. Polysemantic neuron 4e:55, responding to cat faces, car fronts, and cat legs, is presented as a core unsolved challenge, explained mechanistically through a superposition circuit in mixed4c where a pure car feature is spread across dog-detector neurons to exploit near-orthogonality in high-dimensional space. The method introduced for falsifiability is the handwritten circuit reimplementation, in which all weights are set by hand based on the proposed algorithm and the result compared to the original neuron's behavior; an alternative the paper does not adopt is activation patching or causal tracing, which tests circuit claims by intervening on activations rather than requiring complete weight reconstruction. The universality evidence is explicitly characterized as anecdotal, and this is the clearest target for pushback: the paper offers visual inspection across multiple architectures but no quantitative similarity metric, no statistical test, and no control for the possibility that correlated but mechanistically distinct features are being conflated — a concern the authors themselves raise and then leave unresolved. The paper's central prediction is that if universality holds broadly, a periodic table of visual features becomes achievable; it also speculatively proposes that high-low frequency detectors, undescribed in prior neuroscience literature, might be confirmable in biological visual cortex, which would constitute strong cross-domain evidence for universality.

Frameworks (2)

Circuits Thread
An open scientific collaboration hosted on Distill slack studying the inner workings of neural networks via zoomed-in mechanistic investigation
Interpretability as Natural Science
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies

Findings (7)

InceptionV1 implements a four-layer circuit for pose-invariant dog head detection with mirrored left/right pathways that inhibit each other then unite, exhibiting XOR-like properties
Evidence that neural networks learn sophisticated invariance mechanisms through structured circuits rather than loose feature aggregation
Weights between early and full curve detectors in InceptionV1 form a curve of positive weights at tangent positions, with opposing orientations inhibitory
Demonstrates that meaningful algorithms can be read directly off floating-point weights in a neural network
InceptionV1 spreads car feature from a pure car detector in mixed4c across dog detector neurons in the next layer
Circuit-level evidence that polysemantic neurons arise deliberately through superposition rather than entangled computation
Curve detectors found across AlexNet, InceptionV1, VGG19, ResNetV2-50 and models trained on Places365
Anecdotal evidence for the universality of low-level visual features across different architectures and datasets
InceptionV1 neuron 4e:55 responds to cat faces, fronts of cars, and cat legs as unrelated stimuli
Concrete example of polysemantic neuron demonstrating the challenge to the circuits agenda
High-low frequency detectors found across AlexNet, InceptionV1, VGG19, and ResNetV2-50
Second low-level feature type demonstrating cross-architecture universality
Curve detecting neurons found in every non-trivial vision model carefully examined
Empirical basis for treating curve detectors as a canonical example of meaningful, understandable features

Claims (14)

Polysemantic neurons are a major challenge for the circuits agenda, because N meanings in one neuron times M in another creates NxM effective connections that cannot be considered individually.
Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis
If the universality hypothesis is broadly true, it raises the exciting possibility that artificial neural networks could predict features previously unknown in biological neural networks.
Speculative extension of universality to neuroscience, with high-low frequency detectors as a candidate prediction
The typical case is that neurons (or other directions in activation space) are understandable after thousands of hours of study, even when initially mysterious.
Author's interpretive assertion based on extensive empirical investigation, countering texture-only skepticism
Superposition exploits the geometry of high-dimensional spaces, which allow exponentially many almost-orthogonal vectors but only n strictly orthogonal ones.
Mechanistic explanation for why superposition is geometrically feasible
Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
Superposition is in some sense deliberate: the model converts pure neurons into polysemantic neurons to store more features in fewer neurons.
Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
Qualitative research results can change the world: the discovery of cells was qualitative, just as interpretability research is today.
Historical argument defending qualitative interpretability research against dismissal as unscientific
In the long run, studying circuit motifs may be more important than studying individual circuits for understanding neural networks.
Strategic claim about the relative importance of motif-level abstraction over circuit-level analysis
Circuit claims are falsifiable: if you understand a circuit, you should be able to predict what changes when you edit the weights.
Argument that circuits methodology meets natural-science standards of falsifiability

Hypotheses (2)

We hypothesize that high-low frequency detectors, if predicted by artificial neural network universality, might be found in biological neural networks.
Specific cross-domain prediction mentioned by neuroscientists in conversation with the authors
We hypothesize that polysemantic neurons may be resolvable by unfolding networks or training to avoid polysemanticity.
Forward-looking proposal for how the polysemanticity challenge to circuits research might be overcome

Questions (6)

What kind of picture of neural networks would emerge if we treated individual neurons, even individual weights, as being worthy of serious investigation?
Central motivating question for the circuits research program
Lack of rigorous cross-model comparison demonstrating that specific named features (not just correlated ones) form across architectures
Explicitly identified research gap: anecdotal evidence exists but rigorous characterization is absent
Is deep learning at a similar, albeit more modest, transition point as the invention of the microscope?
Motivating analogy question framing the circuits agenda as a potential paradigm shift in interpretability
Whether overall model behavior can be broken down into statements about circuits remains undemonstrated
Identified gap: circuits are small-scope; linking them to model-level behavior requires future work
Is the apparent universality of some low-level vision features the exception or the rule?
Open empirical question following anecdotal cross-model universality findings
No established method for resolving polysemantic neurons into pure features at scale
Identified gap linking polysemanticity challenge to disentangled representations literature

Original abstract (expand)

This article introduces the 'circuits' approach to understanding neural networks by zooming in on individual neurons and their connections, analogous to how microscopes revealed cells in biology. The authors present three speculative claims: (1) features are the fundamental unit of neural networks corresponding to directions in activation space, (2) features are connected by weights forming circuits that implement meaningful algorithms, and (3) analogous features and circuits form across models and tasks. Through examples like curve detectors, dog head detectors, and car detection circuits, the paper argues that neural networks are surprisingly interpretable at fine-grained levels of analysis.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Circuit Insights: Towards Interpretability Beyond Activations
Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin Elena Golimblevskaia
2026
≈ 85%
Certified Circuits: Stability Guarantees for Mechanistic Circuits
Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer Alaa Anani
2026
≈ 84%
Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers
Wolfgang Stammer, Bernt Schiele, Jonas Fischer Nina \.Zukowska
2026
≈ 84%
Architecture, Not Scale: Circuit Localization in Large Language Models
Sohan Venkatesh
2026
≈ 83%
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu Xuyang Ge
2024
≈ 83%
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
Michael Li and Nishant Subramani
2026
≈ 83%
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen Jatin Nainani
2024
≈ 82%
Finding Interpretable Prompt-Specific Circuits in Language Models
Lucas M. Tassis, Azalea Rohr, Mark Crovella Gabriel Franco
2026
≈ 82%
Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models
Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic Liwei Che
2026
≈ 82%
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad and Abhinav Joshi and Ashutosh Modi
2025
≈ 82%
Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees
Guy Katz, Shahaf Bassan Itamar Hadad
2026
≈ 82%
Automated Circuit Interpretation via Probe Prompting
Giuseppe Birardi
2025
≈ 82%
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek Dana Arad
2025
≈ 82%
Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Jack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali
2026
≈ 82%
Interpreting Neural Networks through the Polytope Lens
Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy Sid Black
2022
≈ 82%
The computational boundary of a 'self': developmental bioelectricity drives multicellularity and scale-free cognition
in corpus
2019
≈ 81%
The World Inside Neural Networks
in corpus
2026
≈ 80%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 80%
Differentiable Logic Cellular Automata: From Game of Life to pattern generation with learned recurrent circuits
in corpus
≈ 80%
Multiple ways to implement and infer sentience
in corpus
≈ 80%
A Mathematical Framework for Transformer Circuits
in corpus
2021
≈ 79%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 79%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 79%
Learning without neurons in physical systems
in corpus
2022
≈ 78%
Self-Improvising Memory: A Perspective on Memories as Agential, Dynamically Reinterpreting Cognitive Glue
in corpus
2024
≈ 78%
Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies
in corpus
2023
≈ 78%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 78%
Darwin's agential materials: evolutionary implications of multiscale competency in developmental biology
in corpus
2023
≈ 78%
The biogenic approach to cognition
in corpus
2005
≈ 77%
Advances in neural information processing systems 7
cited
1997
≈ 66%

+29 more

Similar preprints — Semantic Scholar

Cited by (2)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint