paper
active
2026
paper:geiger-goodfire-world-inside-neural-networks-2026

The World Inside Neural Networks

ByAtticus Geiger·Ekdeep Singh Lubana·Thomas Fel·Jack Merullo·Michael Jae Byun·Owen Lewis+1 more

TL;DR

Neural networks trained on structured data develop internal *neural geometries* that mirror the geometric structure of the external world — circular manifolds for numerical sequences, spatial manifolds for positional concepts — and this manifold-level description is more faithful than the sparse linear decompositions produced by Sparse Autoencoders. The Mountain Car environment provides the load-bearing demonstration: car position is encoded as a 1D curved manifold, and linear interventions that ignore this curvature cross activation-space 'voids,' producing incoherent behavior, while interventions that follow the manifold curve yield smooth, coherent control. Goodfire's framing paper introduces *neural geometry* as a principled descriptive substrate and argues directly against SAE-based interpretability, claiming SAE features 'shatter' manifolds into many small, apparently-unrelated pieces that obscure overarching semantic structure. Authored by Geiger, Lubana, Fel, Merullo, Byun, Lewis, and McGrath, the paper positions itself as the conceptual anchor for a coordinated May 2026 trio alongside the Feucht geometric-calculator and Wurgaft manifold-steering papers. The implication is a reorientation of the interpretability program: rather than decomposing representations into ever-more-atomic features, the field should recover fewer, geometrically-accurate objects that track real-world structure — a shift with direct consequences for steering, control, and what counts as a successful mechanistic explanation.

What to take away

  1. 1. Car position in the Mountain Car environment is encoded as a 1D curved manifold in activation space, not as a linear direction.
  2. 2. Linear interventions on the Mountain Car position representation cross activation-space 'voids,' producing incoherent model behavior, while manifold-following interventions produce smooth control.
  3. 3. SAE features 'shatter' curved manifolds into many small, apparently-unrelated pieces, causing SAE-based interpretability to obscure overarching semantic structure rather than reveal it.
  4. 4. Optimization pressure on networks trained on geometrically structured data is sufficient to produce internal manifolds that mirror real-world geometry — the geometry arises from the training signal, not architectural induction.
  5. 5. Numerical concepts are represented as approximately circular (1D closed) manifolds, consistent with the cyclic structure of modular arithmetic and calendar-like sequences.
  6. 6. The paper was published 2026-05-07 as the first of a coordinated three-paper release from Goodfire, framing the empirical results of the Feucht (geometric calculator) and Wurgaft (manifold steering) companion papers.
  7. 7. The authors include Atticus Geiger, Ekdeep Singh Lubana, Thomas Fel, Jack Merullo, Michael Jae Byun, Owen Lewis, and Thomas McGrath, spanning Goodfire and external collaborators.
  8. 8. An open question raised is whether manifold-level geometric descriptions can be made fully compositional — i.e., whether intersecting or combining manifolds preserves the interpretable structure needed for systematic control.
  9. 9. To replicate the Mountain Car intervention comparison, a researcher would train a policy network on the Mountain Car environment, identify the position-encoding layer via probing, fit a 1D manifold to those activations, and compare behavioral coherence between linear and manifold-constrained intervention paths.
  10. 10. The paper predicts that shifting interpretability methodology from atomic SAE features to manifold-level geometric objects will improve both the fidelity of mechanistic explanations and the reliability of activation-steering interventions in larger language models such as Llama-3.

Peer brief — for seminar discussion

Published May 7, 2026, this Goodfire research article by Geiger, Lubana, Fel, Merullo, Byun, Lewis, and McGrath serves as the conceptual manifesto for a coordinated three-paper release, with the Feucht geometric-calculator paper and Wurgaft manifold-steering paper providing empirical support. The core move is to argue that neural networks trained on structured data develop internal *neural geometries* — curved manifolds in activation space — that reflect the geometric structure of the domain, and that these manifold-level descriptions should replace sparse linear decompositions (particularly SAE features) as the primary unit of mechanistic interpretability. The load-bearing empirical anchor is the Mountain Car environment: car position is encoded as a 1D curved manifold, and this matters operationally because linear interventions cross activation-space voids and produce incoherent behavior, while interventions constrained to follow the manifold produce smooth, coherent control. Numerical concepts are described as approximately circular manifolds, and the paper claims geometry emerges from optimization pressure on structured training data rather than from any architectural prior. The method introduced is *neural geometry* as a descriptive framework — identifying, fitting, and intervening along curved manifolds — in contrast to the alternative of SAE-based sparse decomposition, which the paper argues 'shatters' manifolds into many small, apparently-unrelated features that obscure semantic structure. The predictive claim is that adopting manifold-level descriptions will improve both interpretability fidelity and steering reliability in larger systems, including production-scale language models like Llama-3. The most contestable element for a critical reader is the scope of the Mountain Car result: the environment is low-dimensional, the policy network is small, and the manifold is genuinely 1D, making it a nearly ideal case for the framework. Whether the same manifold-following logic scales to the high-dimensional, polysemantic, entangled representations found in large transformers — where manifolds are unlikely to be cleanly separable — is left as an open question rather than demonstrated. A skeptic would also push back on the SAE critique's framing: shattering could reflect genuine polysemy rather than a failure of the decomposition method, and it is not shown that manifold descriptions are more causally complete than SAE features, only that they are more geometrically coherent. The paper does not provide quantitative benchmarks comparing manifold-steering to SAE-based steering on a shared task, which would be the most direct way to adjudicate the claim.

Methods (1)

  • Sparse Autoencoders (SAE)
    Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Frameworks (1)

  • Neural Geometry Framework
    Conceptual scheme introduced in this paper: neural networks develop internal geometric representations that mirror real-world geometry, providing the right level of description for interpretability and control.

Findings (1)

Claims (6)

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar

Cross-corpus bridges (2)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

  • zen
    MANIFOLD-ALIGNMENTapplied/kobun-shape-manifold.md0.762
  • zen
    PROJECT-EIGENQUESTIONSapplied/kobun-shape-eigenquestions.md0.756