Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

ByDaniel Wurgaft·Can Rager·Matthew Kowal·Vasudev Shyam·Sheridan Feucht·Usha Bhalla+10 moreGoodfire, Harvard University + 2 more

DOI 10.48550/arxiv.2605.05115 arXiv 2605.05115 OpenAlex W7160544910

Manifold geometry for steering networks Manifold geometry for neural network steering Neural Geometry in-context learning (ICL)manifold learning activation manifold fitting (M_h)Language Model behavior manifold fitting (M_y)Manifold Steering optimization of interventions to follow behavior manifold M_y off-manifold regions Representation Steering sequential reasoning tasks unnatural outputs Video World Model

TL;DR

Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectories that track the corresponding behavior manifold M_y, while linear (Euclidean) steering cuts through off-manifold regions and generates unnatural outputs. The paper fits M_h to internal representations and M_y to output probability distributions, then tests their bidirectional correspondence via controlled interventions across language models and a video world model. In language models, tasks with cyclic geometries, sequential geometries, and complex graph geometries (in-context learning) all show that manifold-constrained interventions keep behavioral outputs on M_y, whereas linear steering deviates measurably. In a video world model, interventions shaped by physical-dynamics geometry similarly respect M_y. Crucially, the relationship is bidirectional: optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h. This implies that the core problem of mechanistic steering should be recast not as finding the right direction in a flat Euclidean activation space, but as identifying the correct geometric structure — because representational geometry is not incidental to model behavior but is causally constitutive of it.

What to take away

1. Manifold steering — intervening along paths on the learned activation manifold M_h — produces behavioral outputs that remain on the behavior manifold M_y, whereas linear Euclidean steering consistently exits M_y and produces off-distribution outputs.
2. The correspondence between M_h and M_y is bidirectional: optimizing activation interventions to stay on M_y recovers trajectories that trace the curvature of M_h, not just the reverse direction.
3. Experiments span language models tested on reasoning tasks with cyclic, sequential, and graph geometries, plus a video world model tested on a physical-dynamics task, demonstrating cross-modal generality of the M_h ↔ M_y link.
4. In-context learning tasks in language models exhibit graph geometries in activation space, and manifold-constrained interventions respect this structure while linear probes do not.
5. The paper introduces the manifold steering method, which fits M_h to activation representations and M_y to output probability distributions, then uses geodesic or manifold-respecting paths for activation intervention rather than additive linear vectors.
6. Linear steering — the dominant paradigm in prior activation-editing work — is framed as a special case that implicitly assumes Euclidean (flat) geometry, an assumption the paper shows is empirically violated across all tested tasks.
7. An open question the paper raises is whether the specific geometric structures identified (cyclic, sequential, graph, physical-dynamics) exhaust the relevant manifold families for real-world tasks, or whether more complex topologies will require new fitting procedures.
8. To replicate the methodology, researchers should fit a manifold to a model's intermediate-layer activations across a distribution of task inputs, fit a separate manifold to the model's output probability distributions over the same inputs, and then compare behavioral drift under manifold-constrained versus linear interventions.
9. The video world model experiments establish that the M_h ↔ M_y relationship extends beyond language, holding for geometry corresponding to physical dynamics in a visually grounded setting.
10. The paper argues this recasts the core problem of model steering from vector search — finding the right direction — to geometric search — finding the right manifold structure — with direct implications for interpretability-based control methods.

Peer brief — for seminar discussion

Wurgaft et al. (2026) take on a foundational question in mechanistic interpretability: does the geometric structure visible in neural activation spaces causally constrain behavior, or is it epiphenomenal? To test this, they develop manifold steering, a method that fits an activation manifold M_h to a model's intermediate representations and a behavior manifold M_y to its output probability distributions, then performs interventions along geodesic paths on M_h rather than along Euclidean linear vectors. The key experimental move is measuring whether behavioral trajectories induced by activation interventions stay on M_y (natural) or cut through off-manifold regions (unnatural). An alternative approach the paper could have used — but does not — is distributed interchange intervention (DII) or causal abstraction probing, which also test causal structure but without the manifold-fitting step. The load-bearing finding is a demonstrated bidirectional correspondence: manifold steering keeps outputs on M_y across cyclic-geometry tasks, sequential-geometry tasks, graph-geometry in-context learning tasks in language models, and a physical-dynamics task in a video world model, while linear Euclidean steering fails in all settings. The bidirectionality is empirically shown by running the inference in reverse — optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h. This holds across at least 4 distinct geometric task families and 2 modalities (language and video). The implication the paper argues for is a reframing of the steering problem itself: not "find the right direction" but "find the right geometry." This has practical consequences for activation editing, representation engineering, and any method that assumes linearity when composing concept vectors. The prediction embedded in the work is that geometry-respecting interventions will generalize to novel tasks precisely to the extent that M_h and M_y share curvature structure — a hypothesis that remains to be tested on open-ended generation or adversarial inputs. A critical reader should push back on the manifold-fitting procedure itself: M_h and M_y are estimated from finite samples of activations and output distributions over a task-specific input distribution, which means the quality of the fitted manifolds — and thus the apparent success of manifold steering — is contingent on whether the evaluation distribution matches the fitting distribution. If M_h is fit on in-distribution task trajectories, it is unsurprising that interventions constrained to M_h stay on M_y; the interesting test would be whether manifold steering generalizes when M_h is fit on one task geometry (e.g., cyclic) and interventions are evaluated on a structurally different geometry (e.g., graph). The paper's cross-modal results are suggestive, but within each task the fitting and evaluation regimes appear to be aligned, which risks overstating the causal claim.

Methods (3)

activation manifold fitting (M_h)
Method to fit a manifold M_h to neural representations in activation space.
behavior manifold fitting (M_y)
Method to fit a manifold M_y to output probability distributions.
optimization of interventions to follow behavior manifold M_y
Method that optimizes activation interventions so that resulting behaviors trace M_y, recovering activation paths that follow M_h curvature.

Frameworks (1)

manifold learning
Technique used to fit M_h and M_y from data; enables manifold steering.

Findings (5)

Interventions along activation manifold M_h yield behavioral trajectories following behavior manifold M_y, and vice versa — bidirectional relationship demonstrated across language models and video world models.
Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
Manifold steering demonstrates bidirectional geometry-behavior link in a video world model on tasks with geometry corresponding to physical dynamics
Extension of manifold steering validation to video world models and physical dynamics tasks, demonstrating cross-modal generality
Optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h.
Demonstrates bidirectional causal link: behavior manifold geometry can be recovered by optimizing in representation space.
Steering along M_h yields behavioral trajectories that follow M_y, producing more natural outputs than linear steering
Core empirical result demonstrating the superiority of manifold steering over linear steering
Cross-task and cross-modal validation of manifold steering
The paper demonstrates the bidirectional geometry-behavior relationship across multiple tasks and modalities (language models and video world models)

Claims (6)

Geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals.
Core interpretive assertion: geometric structure is causally load-bearing, not epiphenomenal.
There exists a bidirectional relationship between the geometry of neural representation and the geometry of model behavior
Central empirical claim of the paper, demonstrated across tasks and modalities
There is a bidirectional relationship between the geometry of representation and behavior across tasks and modalities.
Author’s interpretive claim that the shared geometry is general and robust.
The core problem of steering should be recast from finding the right direction to finding the right geometry
The paper's programmatic conclusion about how the field should reconceptualize neural network steering
Linear steering cuts through off-manifold regions and hence produces unnatural outputs.
Attribution of failure to Euclidean assumption.
Geometric structure of neural representations causally shapes model behavior
The paper's core causal assertion: geometry is not incidental but mechanistically linked to behavior

Hypotheses (2)

We hypothesize that interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally
The core testable hypothesis driving the experimental design
Neural representation geometry causally shapes behavior; interventions respecting that geometry will yield natural trajectories.
Central hypothesis tested via manifold steering experiments across language models and video world models.

Questions (4)

Does the geometric structure of activation space causally shape neural network behavior?
Central research question driving the work.
What is the right geometry for enabling principled steering of neural network behavior?
The reframed steering problem the paper introduces
Does the geometric structure of neural representations causally shape model behavior?
The motivating research question of the paper
does that structure causally shape behavior?
Opening question: does the rich geometric structure of neural representations have a causal role in behavior?

Original abstract (expand)

Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana Daniel Wurgaft
2026
≈ 96%
Steering Along Manifolds to Control Neural Networks
in corpus
≈ 93%
Addressing divergent representations from causal interventions on neural networks
cited
in corpus
2025
≈ 84%
Curveball Steering: The Right Direction To Steer Isn't Always Linear
Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, Amirali Abdullah Shivam Raval
2026
≈ 87%
The World Inside Neural Networks
in corpus
2026
≈ 87%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
cited
in corpus
2025
≈ 81%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
cited
in corpus
2023
≈ 80%
Mitigating Overthinking in Large Reasoning Models via Manifold Steering
Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong Yao Huang
2025
≈ 84%
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
Ruixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang Zehao Jin
2026
≈ 84%
Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics
Fanny Jourdan, Antonin Poch\'e, C\'eline Hudelot Raphael Bernas
2026
≈ 83%
Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model
Payel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer Rio Alexa Fear
2025
≈ 83%
Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations
Haijiang Yan, Thomas L. Griffiths Jian-Qiao Zhu
2025
≈ 83%
On the Non-Identifiability of Steering Vectors in Large Language Models
Ashish Mahendran Kurapath Sohan Venkatesh
2026
≈ 83%
Representational Curvature Modulates Behavioral Uncertainty in Large Language Models
Evelina Fedorenko, Eghbal A. Hosseini Jack King
2026
≈ 82%
Patches of Nonlinearity: Instruction Vectors in Large Language Models
Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva
2026
≈ 82%
Mechanistic Indicators of Steering Effectiveness in Large Language Models
Hao Xue, Flora Salim Mehdi Jafari
2026
≈ 82%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 82%
The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Alper Y{\i}ld{\i}r{\i}m
2026
≈ 82%
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Xinyue Annie Yang, Glen Chou Julian Skifstad
2026
≈ 82%
A Timeline and Analysis for Representation Plasticity in Large Language Models
Akshat Kannan
2024
≈ 82%
Psychological Steering of Large Language Models
in corpus
2026
≈ 81%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 80%
Steering language models with activation engineering
cited
2023
≈ 80%
The Platonic Representation Hypothesis
in corpus
2024
≈ 80%
Model Alignment Search
in corpus
2025
≈ 80%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 79%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 79%
A geometric notion of causal probing
cited
2023
≈ 79%
Learning without neurons in physical systems
in corpus
2022
≈ 78%
Semantic structure in large language model embeddings
cited
2025
≈ 78%

+20 more