paper
active
2026
paper:doi-10-48550-arxiv-2605-05115

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

TL;DR

Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectories that track the corresponding behavior manifold M_y, while linear (Euclidean) steering cuts through off-manifold regions and generates unnatural outputs. The paper fits M_h to internal representations and M_y to output probability distributions, then tests their bidirectional correspondence via controlled interventions across language models and a video world model. In language models, tasks with cyclic geometries, sequential geometries, and complex graph geometries (in-context learning) all show that manifold-constrained interventions keep behavioral outputs on M_y, whereas linear steering deviates measurably. In a video world model, interventions shaped by physical-dynamics geometry similarly respect M_y. Crucially, the relationship is bidirectional: optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h. This implies that the core problem of mechanistic steering should be recast not as finding the right direction in a flat Euclidean activation space, but as identifying the correct geometric structure — because representational geometry is not incidental to model behavior but is causally constitutive of it.

What to take away

  1. 1. Manifold steering — intervening along paths on the learned activation manifold M_h — produces behavioral outputs that remain on the behavior manifold M_y, whereas linear Euclidean steering consistently exits M_y and produces off-distribution outputs.
  2. 2. The correspondence between M_h and M_y is bidirectional: optimizing activation interventions to stay on M_y recovers trajectories that trace the curvature of M_h, not just the reverse direction.
  3. 3. Experiments span language models tested on reasoning tasks with cyclic, sequential, and graph geometries, plus a video world model tested on a physical-dynamics task, demonstrating cross-modal generality of the M_h ↔ M_y link.
  4. 4. In-context learning tasks in language models exhibit graph geometries in activation space, and manifold-constrained interventions respect this structure while linear probes do not.
  5. 5. The paper introduces the manifold steering method, which fits M_h to activation representations and M_y to output probability distributions, then uses geodesic or manifold-respecting paths for activation intervention rather than additive linear vectors.
  6. 6. Linear steering — the dominant paradigm in prior activation-editing work — is framed as a special case that implicitly assumes Euclidean (flat) geometry, an assumption the paper shows is empirically violated across all tested tasks.
  7. 7. An open question the paper raises is whether the specific geometric structures identified (cyclic, sequential, graph, physical-dynamics) exhaust the relevant manifold families for real-world tasks, or whether more complex topologies will require new fitting procedures.
  8. 8. To replicate the methodology, researchers should fit a manifold to a model's intermediate-layer activations across a distribution of task inputs, fit a separate manifold to the model's output probability distributions over the same inputs, and then compare behavioral drift under manifold-constrained versus linear interventions.
  9. 9. The video world model experiments establish that the M_h ↔ M_y relationship extends beyond language, holding for geometry corresponding to physical dynamics in a visually grounded setting.
  10. 10. The paper argues this recasts the core problem of model steering from vector search — finding the right direction — to geometric search — finding the right manifold structure — with direct implications for interpretability-based control methods.

Peer brief — for seminar discussion

Wurgaft et al. (2026) take on a foundational question in mechanistic interpretability: does the geometric structure visible in neural activation spaces causally constrain behavior, or is it epiphenomenal? To test this, they develop manifold steering, a method that fits an activation manifold M_h to a model's intermediate representations and a behavior manifold M_y to its output probability distributions, then performs interventions along geodesic paths on M_h rather than along Euclidean linear vectors. The key experimental move is measuring whether behavioral trajectories induced by activation interventions stay on M_y (natural) or cut through off-manifold regions (unnatural). An alternative approach the paper could have used — but does not — is distributed interchange intervention (DII) or causal abstraction probing, which also test causal structure but without the manifold-fitting step. The load-bearing finding is a demonstrated bidirectional correspondence: manifold steering keeps outputs on M_y across cyclic-geometry tasks, sequential-geometry tasks, graph-geometry in-context learning tasks in language models, and a physical-dynamics task in a video world model, while linear Euclidean steering fails in all settings. The bidirectionality is empirically shown by running the inference in reverse — optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h. This holds across at least 4 distinct geometric task families and 2 modalities (language and video). The implication the paper argues for is a reframing of the steering problem itself: not "find the right direction" but "find the right geometry." This has practical consequences for activation editing, representation engineering, and any method that assumes linearity when composing concept vectors. The prediction embedded in the work is that geometry-respecting interventions will generalize to novel tasks precisely to the extent that M_h and M_y share curvature structure — a hypothesis that remains to be tested on open-ended generation or adversarial inputs. A critical reader should push back on the manifold-fitting procedure itself: M_h and M_y are estimated from finite samples of activations and output distributions over a task-specific input distribution, which means the quality of the fitted manifolds — and thus the apparent success of manifold steering — is contingent on whether the evaluation distribution matches the fitting distribution. If M_h is fit on in-distribution task trajectories, it is unsurprising that interventions constrained to M_h stay on M_y; the interesting test would be whether manifold steering generalizes when M_h is fit on one task geometry (e.g., cyclic) and interventions are evaluated on a structurally different geometry (e.g., graph). The paper's cross-modal results are suggestive, but within each task the fitting and evaluation regimes appear to be aligned, which risks overstating the causal claim.

Methods (3)

Frameworks (1)

  • manifold learning
    Technique used to fit M_h and M_y from data; enables manifold steering.

Findings (5)

Claims (6)

Questions (4)

Original abstract (expand)

Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+20 more

Similar preprints — Semantic Scholar