paper:geiger-goodfire-world-inside-neural-networks-2026The World Inside Neural Networks
TL;DR
Neural networks trained on structured data develop internal *neural geometries* that mirror the geometric structure of the external world — circular manifolds for numerical sequences, spatial manifolds for positional concepts — and this manifold-level description is more faithful than the sparse linear decompositions produced by Sparse Autoencoders. The Mountain Car environment provides the load-bearing demonstration: car position is encoded as a 1D curved manifold, and linear interventions that ignore this curvature cross activation-space 'voids,' producing incoherent behavior, while interventions that follow the manifold curve yield smooth, coherent control. Goodfire's framing paper introduces *neural geometry* as a principled descriptive substrate and argues directly against SAE-based interpretability, claiming SAE features 'shatter' manifolds into many small, apparently-unrelated pieces that obscure overarching semantic structure. Authored by Geiger, Lubana, Fel, Merullo, Byun, Lewis, and McGrath, the paper positions itself as the conceptual anchor for a coordinated May 2026 trio alongside the Feucht geometric-calculator and Wurgaft manifold-steering papers. The implication is a reorientation of the interpretability program: rather than decomposing representations into ever-more-atomic features, the field should recover fewer, geometrically-accurate objects that track real-world structure — a shift with direct consequences for steering, control, and what counts as a successful mechanistic explanation.
What to take away
- 1. Car position in the Mountain Car environment is encoded as a 1D curved manifold in activation space, not as a linear direction.
- 2. Linear interventions on the Mountain Car position representation cross activation-space 'voids,' producing incoherent model behavior, while manifold-following interventions produce smooth control.
- 3. SAE features 'shatter' curved manifolds into many small, apparently-unrelated pieces, causing SAE-based interpretability to obscure overarching semantic structure rather than reveal it.
- 4. Optimization pressure on networks trained on geometrically structured data is sufficient to produce internal manifolds that mirror real-world geometry — the geometry arises from the training signal, not architectural induction.
- 5. Numerical concepts are represented as approximately circular (1D closed) manifolds, consistent with the cyclic structure of modular arithmetic and calendar-like sequences.
- 6. The paper was published 2026-05-07 as the first of a coordinated three-paper release from Goodfire, framing the empirical results of the Feucht (geometric calculator) and Wurgaft (manifold steering) companion papers.
- 7. The authors include Atticus Geiger, Ekdeep Singh Lubana, Thomas Fel, Jack Merullo, Michael Jae Byun, Owen Lewis, and Thomas McGrath, spanning Goodfire and external collaborators.
- 8. An open question raised is whether manifold-level geometric descriptions can be made fully compositional — i.e., whether intersecting or combining manifolds preserves the interpretable structure needed for systematic control.
- 9. To replicate the Mountain Car intervention comparison, a researcher would train a policy network on the Mountain Car environment, identify the position-encoding layer via probing, fit a 1D manifold to those activations, and compare behavioral coherence between linear and manifold-constrained intervention paths.
- 10. The paper predicts that shifting interpretability methodology from atomic SAE features to manifold-level geometric objects will improve both the fidelity of mechanistic explanations and the reliability of activation-steering interventions in larger language models such as Llama-3.
Peer brief — for seminar discussion
Published May 7, 2026, this Goodfire research article by Geiger, Lubana, Fel, Merullo, Byun, Lewis, and McGrath serves as the conceptual manifesto for a coordinated three-paper release, with the Feucht geometric-calculator paper and Wurgaft manifold-steering paper providing empirical support. The core move is to argue that neural networks trained on structured data develop internal *neural geometries* — curved manifolds in activation space — that reflect the geometric structure of the domain, and that these manifold-level descriptions should replace sparse linear decompositions (particularly SAE features) as the primary unit of mechanistic interpretability. The load-bearing empirical anchor is the Mountain Car environment: car position is encoded as a 1D curved manifold, and this matters operationally because linear interventions cross activation-space voids and produce incoherent behavior, while interventions constrained to follow the manifold produce smooth, coherent control. Numerical concepts are described as approximately circular manifolds, and the paper claims geometry emerges from optimization pressure on structured training data rather than from any architectural prior. The method introduced is *neural geometry* as a descriptive framework — identifying, fitting, and intervening along curved manifolds — in contrast to the alternative of SAE-based sparse decomposition, which the paper argues 'shatters' manifolds into many small, apparently-unrelated features that obscure semantic structure. The predictive claim is that adopting manifold-level descriptions will improve both interpretability fidelity and steering reliability in larger systems, including production-scale language models like Llama-3. The most contestable element for a critical reader is the scope of the Mountain Car result: the environment is low-dimensional, the policy network is small, and the manifold is genuinely 1D, making it a nearly ideal case for the framework. Whether the same manifold-following logic scales to the high-dimensional, polysemantic, entangled representations found in large transformers — where manifolds are unlikely to be cleanly separable — is left as an open question rather than demonstrated. A skeptic would also push back on the SAE critique's framing: shattering could reflect genuine polysemy rather than a failure of the decomposition method, and it is not shown that manifold descriptions are more causally complete than SAE features, only that they are more geometrically coherent. The paper does not provide quantitative benchmarks comparing manifold-steering to SAE-based steering on a shared task, which would be the most direct way to adjudicate the claim.
Methods (1)
- Sparse Autoencoders (SAE)Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Frameworks (1)
- Neural Geometry FrameworkConceptual scheme introduced in this paper: neural networks develop internal geometric representations that mirror real-world geometry, providing the right level of description for interpretability and control.
Findings (1)
- In the Mountain Car case study, car position is a 1D manifold; linear interventions cross voids causing incoherence; following the 1D curve produces smooth control.
Empirical demonstration that a semantically meaningful variable is encoded as a curved manifold, and that respecting its geometry is critical for effective intervention.
Claims (6)
- Linear interventions across voids in activation space produce incoherent output, while following the manifold curve produces smooth control.
General principle derived from the Mountain Car experiment: curved manifold-following yields coherent manipulation, linear shortcuts fail.
- SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
- Manifold-level descriptions recover overarching semantic structure that SAE features miss.
Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.
- Geometry arises from optimization pressure on networks trained on structured data.
Mechanistic explanation: geometric structure emerges naturally from standard training on data with underlying structure.
- Networks encode structured geometric concepts that reflect external reality.
Core claim of the paper: the right level of description for neural representations is geometric structure mirroring the world.
- Curved manifolds often represent concepts better than linear directions.
Proposes that nonlinear geometric structure is superior to linear feature spaces for capturing semantic content.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 87%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 87%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and BehaviorCan Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana Daniel Wurgaft2026≈ 86%
- ≈ 84%
- Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity AnalysisAdam Eisen, Leo Kozachkov, Ila Fiete Mitchell Ostrow2023≈ 83%
- ≈ 82%
- Conceptual Views of Neural Networks: A Framework for Neuro-Symbolic AnalysisJohannes Hirth and Tom Hanika2026≈ 82%
- ≈ 82%
- Neural Manifolds as Crystallized Embeddings: A Synthesis of the Free Energy Principle, Generalized Synchronization, and Hebbian PlasticityVikas N. O'Reilly-Shah2026≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 81%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 81%
- Causal Probing for Internal Visual Representations in Multimodal Large Language ModelsTianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng2026≈ 81%
- The Platonic Representation Hypothesisin corpus2024≈ 81%
- ≈ 81%
- From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMsLiner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang Jiyuan An2026≈ 81%
- Human Cognition in Machines: A Unified Perspective of World ModelsPu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Silvia Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang Timothy Rupprecht2026≈ 81%
- On the Road with 16 Neurons: Mental Imagery with Bio-inspired Deep Neural NetworksAlice Plebe and Mauro Da Lio2026≈ 81%
- From Transformer to Biology: A Hierarchical Model for Attention in Complex Problem-SolvingYunwei Li, Tianming Yang Zhongqiao Lin2025≈ 81%
- ≈ 81%
- ≈ 81%
- Zoom In: An Introduction to Circuitsin corpus2020≈ 80%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 80%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 79%
- Learning without neurons in physical systemsin corpus2022≈ 79%
Similar preprints — Semantic Scholar
Cross-corpus bridges (2)
same_concept_as · Nomic cosineExternal markdown files that talk about the same concept as this entity.
- zenMANIFOLD-ALIGNMENTapplied/kobun-shape-manifold.md0.762
- zenPROJECT-EIGENQUESTIONSapplied/kobun-shape-eigenquestions.md0.756