The World Inside Neural Networks

ByAtticus Geiger·Ekdeep Singh Lubana·Thomas Fel·Jack Merullo·Michael Jae Byun·Owen Lewis+1 more

Neural Geometry atomic features Neural Geometry Framework Sparse Autoencoders (SAE)Manifold Steering Mechanistic Interpretability Mountain Car Case Study Neural Geometries

TL;DR

Neural networks trained on structured data develop internal *neural geometries* that mirror the geometric structure of the external world — circular manifolds for numerical sequences, spatial manifolds for positional concepts — and this manifold-level description is more faithful than the sparse linear decompositions produced by Sparse Autoencoders. The Mountain Car environment provides the load-bearing demonstration: car position is encoded as a 1D curved manifold, and linear interventions that ignore this curvature cross activation-space 'voids,' producing incoherent behavior, while interventions that follow the manifold curve yield smooth, coherent control. Goodfire's framing paper introduces *neural geometry* as a principled descriptive substrate and argues directly against SAE-based interpretability, claiming SAE features 'shatter' manifolds into many small, apparently-unrelated pieces that obscure overarching semantic structure. Authored by Geiger, Lubana, Fel, Merullo, Byun, Lewis, and McGrath, the paper positions itself as the conceptual anchor for a coordinated May 2026 trio alongside the Feucht geometric-calculator and Wurgaft manifold-steering papers. The implication is a reorientation of the interpretability program: rather than decomposing representations into ever-more-atomic features, the field should recover fewer, geometrically-accurate objects that track real-world structure — a shift with direct consequences for steering, control, and what counts as a successful mechanistic explanation.

What to take away

1. Car position in the Mountain Car environment is encoded as a 1D curved manifold in activation space, not as a linear direction.
2. Linear interventions on the Mountain Car position representation cross activation-space 'voids,' producing incoherent model behavior, while manifold-following interventions produce smooth control.
3. SAE features 'shatter' curved manifolds into many small, apparently-unrelated pieces, causing SAE-based interpretability to obscure overarching semantic structure rather than reveal it.
4. Optimization pressure on networks trained on geometrically structured data is sufficient to produce internal manifolds that mirror real-world geometry — the geometry arises from the training signal, not architectural induction.
5. Numerical concepts are represented as approximately circular (1D closed) manifolds, consistent with the cyclic structure of modular arithmetic and calendar-like sequences.
6. The paper was published 2026-05-07 as the first of a coordinated three-paper release from Goodfire, framing the empirical results of the Feucht (geometric calculator) and Wurgaft (manifold steering) companion papers.
7. The authors include Atticus Geiger, Ekdeep Singh Lubana, Thomas Fel, Jack Merullo, Michael Jae Byun, Owen Lewis, and Thomas McGrath, spanning Goodfire and external collaborators.
8. An open question raised is whether manifold-level geometric descriptions can be made fully compositional — i.e., whether intersecting or combining manifolds preserves the interpretable structure needed for systematic control.
9. To replicate the Mountain Car intervention comparison, a researcher would train a policy network on the Mountain Car environment, identify the position-encoding layer via probing, fit a 1D manifold to those activations, and compare behavioral coherence between linear and manifold-constrained intervention paths.
10. The paper predicts that shifting interpretability methodology from atomic SAE features to manifold-level geometric objects will improve both the fidelity of mechanistic explanations and the reliability of activation-steering interventions in larger language models such as Llama-3.

Peer brief — for seminar discussion

Published May 7, 2026, this Goodfire research article by Geiger, Lubana, Fel, Merullo, Byun, Lewis, and McGrath serves as the conceptual manifesto for a coordinated three-paper release, with the Feucht geometric-calculator paper and Wurgaft manifold-steering paper providing empirical support. The core move is to argue that neural networks trained on structured data develop internal *neural geometries* — curved manifolds in activation space — that reflect the geometric structure of the domain, and that these manifold-level descriptions should replace sparse linear decompositions (particularly SAE features) as the primary unit of mechanistic interpretability. The load-bearing empirical anchor is the Mountain Car environment: car position is encoded as a 1D curved manifold, and this matters operationally because linear interventions cross activation-space voids and produce incoherent behavior, while interventions constrained to follow the manifold produce smooth, coherent control. Numerical concepts are described as approximately circular manifolds, and the paper claims geometry emerges from optimization pressure on structured training data rather than from any architectural prior. The method introduced is *neural geometry* as a descriptive framework — identifying, fitting, and intervening along curved manifolds — in contrast to the alternative of SAE-based sparse decomposition, which the paper argues 'shatters' manifolds into many small, apparently-unrelated features that obscure semantic structure. The predictive claim is that adopting manifold-level descriptions will improve both interpretability fidelity and steering reliability in larger systems, including production-scale language models like Llama-3. The most contestable element for a critical reader is the scope of the Mountain Car result: the environment is low-dimensional, the policy network is small, and the manifold is genuinely 1D, making it a nearly ideal case for the framework. Whether the same manifold-following logic scales to the high-dimensional, polysemantic, entangled representations found in large transformers — where manifolds are unlikely to be cleanly separable — is left as an open question rather than demonstrated. A skeptic would also push back on the SAE critique's framing: shattering could reflect genuine polysemy rather than a failure of the decomposition method, and it is not shown that manifold descriptions are more causally complete than SAE features, only that they are more geometrically coherent. The paper does not provide quantitative benchmarks comparing manifold-steering to SAE-based steering on a shared task, which would be the most direct way to adjudicate the claim.

Methods (1)

Sparse Autoencoders (SAE)
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Frameworks (1)

Neural Geometry Framework
Conceptual scheme introduced in this paper: neural networks develop internal geometric representations that mirror real-world geometry, providing the right level of description for interpretability and control.

Findings (1)

In the Mountain Car case study, car position is a 1D manifold; linear interventions cross voids causing incoherence; following the 1D curve produces smooth control.
Empirical demonstration that a semantically meaningful variable is encoded as a curved manifold, and that respecting its geometry is critical for effective intervention.

Claims (6)

Linear interventions across voids in activation space produce incoherent output, while following the manifold curve produces smooth control.
General principle derived from the Mountain Car experiment: curved manifold-following yields coherent manipulation, linear shortcuts fail.
SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Manifold-level descriptions recover overarching semantic structure that SAE features miss.
Positive claim that geometric descriptions retain the conceptual coherence lost in atomized feature decompositions.
Geometry arises from optimization pressure on networks trained on structured data.
Mechanistic explanation: geometric structure emerges naturally from standard training on data with underlying structure.
Networks encode structured geometric concepts that reflect external reality.
Core claim of the paper: the right level of description for neural representations is geometric structure mirroring the world.
Curved manifolds often represent concepts better than linear directions.
Proposes that nonlinear geometric structure is superior to linear feature spaces for capturing semantic content.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Steering Along Manifolds to Control Neural Networks
in corpus
≈ 87%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 87%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana Daniel Wurgaft
2026
≈ 86%
Neural World Models for Computer Vision
Anthony Hu
2023
≈ 84%
Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis
Adam Eisen, Leo Kozachkov, Ila Fiete Mitchell Ostrow
2023
≈ 83%
AI and World Models
Robert Worden
2026
≈ 82%
Conceptual Views of Neural Networks: A Framework for Neuro-Symbolic Analysis
Johannes Hirth and Tom Hanika
2026
≈ 82%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 82%
Neural Manifolds as Crystallized Embeddings: A Synthesis of the Free Energy Principle, Generalized Synchronization, and Hebbian Plasticity
Vikas N. O'Reilly-Shah
2026
≈ 82%
VISTA: A Panoramic View of Neural Representations
Tom White
2024
≈ 82%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 82%
Deep Neuroevolution of Recurrent and Discrete World Models
Kenneth O. Stanley Sebastian Risi
2019
≈ 81%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 81%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 81%
The Platonic Representation Hypothesis
in corpus
2024
≈ 81%
Interpreting Language Models Through Concept Descriptions: A Survey
Laura Kopf Nils Feldhus
2026
≈ 81%
From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang Jiyuan An
2026
≈ 81%
Human Cognition in Machines: A Unified Perspective of World Models
Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Silvia Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang Timothy Rupprecht
2026
≈ 81%
On the Road with 16 Neurons: Mental Imagery with Bio-inspired Deep Neural Networks
Alice Plebe and Mauro Da Lio
2026
≈ 81%
From Transformer to Biology: A Hierarchical Model for Attention in Complex Problem-Solving
Yunwei Li, Tianming Yang Zhongqiao Lin
2025
≈ 81%
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Xingrui Gu
2026
≈ 81%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 81%
Zoom In: An Introduction to Circuits
in corpus
2020
≈ 80%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 80%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 79%
Learning without neurons in physical systems
in corpus
2022
≈ 79%

Similar preprints — Semantic Scholar

Cross-corpus bridges (2)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

zen
MANIFOLD-ALIGNMENTapplied/kobun-shape-manifold.md0.762
zen
PROJECT-EIGENQUESTIONSapplied/kobun-shape-eigenquestions.md0.756