paper:wurgaft-goodfire-manifold-steering-2026Steering Along Manifolds to Control Neural Networks
TL;DR
Representation steering interventions that follow the curved geometry of a concept's activation manifold outperform linear steering for cyclic and sequential concepts. Working primarily with Llama-3.1 8B, Wurgaft et al. introduce *manifold steering* — a geometry-aware intervention method that moves activations along the learned surface a concept occupies rather than in a straight line through ambient activation space. The key empirical finding is that days-of-the-week representations in Llama-3.1 8B form a circle that appears with the same structure in both internal activation space and output token-probability distributions, and linear steering across this circle produces off-target probability mass and degraded specificity, while manifold steering cleanly transfers probability between adjacent days. The method generalizes beyond days to months, letters, ages, and cross-modal image-action models, suggesting the phenomenon is substrate-general rather than task-specific. Paired with related work on geometric computation (the Feucht et al. geometric-calculator results, arXiv 2026), this paper argues that neural networks do not merely represent concepts in linear subspaces but actively compute on curved manifolds, and that any steering or control technique that ignores this geometry will incur systematic off-target effects — making manifold geometry a load-bearing design constraint for mechanistic interpretability and feature-based model control.
What to take away
- 1. Manifold steering applied to days-of-the-week in Llama-3.1 8B produces cleaner probability shifts between adjacent days than linear steering, which generates measurable off-target activation noise.
- 2. The circular manifold structure for days-of-the-week appears with the same topology in both Llama-3.1 8B's internal residual-stream activations and its output token-probability distributions, confirming cross-space geometric consistency.
- 3. The core method introduced is manifold steering, which parameterizes interventions as geodesic paths along the concept manifold rather than straight-line displacements in the full activation space.
- 4. Linear steering across cyclic concept structures (e.g., Wednesday → Friday) produces off-target probability mass on non-target days, a failure mode that manifold steering largely eliminates.
- 5. The geometric structure and steering improvement generalize across at least four concept types: days of the week, months of the year, letters of the alphabet, and age representations.
- 6. Manifold steering also extends to cross-modal settings, including image-action models, suggesting the geometry-respecting intervention principle is not specific to language model activations.
- 7. The paper raises the open question of whether non-cyclic but compositionally structured concepts (e.g., emotional or social-cognitive states) also occupy curved manifolds that would similarly require geometry-aware steering.
- 8. As a replicable methodology, the authors identify concept manifold geometry by projecting activations onto low-dimensional subspaces and fitting the topology (e.g., circular for periodic concepts) before designing the intervention path.
- 9. This work is the operational counterpart to the Feucht et al. 2026 geometric-calculator findings, which established that Llama-3.1 8B computes on geometric manifolds; together they form a discovery-then-control pipeline.
- 10. The authors argue that any feature steering approach — including safety-relevant interventions targeting specific behavioral features — will incur systematic errors if it assumes linear concept geometry in networks that encode concepts on curved surfaces.
Peer brief — for seminar discussion
Working with Llama-3.1 8B as the primary substrate, this paper investigates whether steering interventions in neural network activation space should respect the curved geometry of the manifold a concept occupies rather than taking straight-line paths through ambient space. The method introduced is manifold steering, which first identifies the topological structure of a concept's representation (e.g., a circle for days of the week) and then moves activations along geodesic paths on that surface. The alternative it implicitly competes against is standard linear or mean-difference steering, which adds a fixed direction vector regardless of where on the concept manifold the current activation lies. The load-bearing finding is twofold. First, days-of-the-week representations in Llama-3.1 8B form a clean circular manifold that is structurally identical across two distinct measurement spaces: the internal residual-stream activations and the output token-probability distributions. Second, when a steering intervention must cross this circle — say, shifting model behavior from Wednesday to Saturday — linear steering scatters probability mass across non-target days, while manifold steering concentrates it on the intended target. The generalization sweep is important: the same geometry-respecting advantage holds for months, letters, ages, and at least one cross-modal image-action model, making this more than a days-of-week curiosity. The implications the paper presses are strong. If concepts are encoded on curved surfaces and networks compute by moving along those surfaces (as the companion Feucht et al. 2026 geometric-calculator result establishes for Llama-3.1 8B), then linear steering is structurally misspecified for any cyclic or sequential concept. This reframes the recurring problem of off-target effects in feature steering not as a noise-floor issue but as a geometry mismatch. For safety-relevant steering — say, shifting activations away from harmful behavioral features — the prediction is that linear interventions will have systematic collateral effects proportional to the curvature of the relevant concept manifold. A critical reader would push back on scope: the demonstrated cases (days, months, letters, ages) are all periodic or ordinal concepts with obvious low-dimensional circular or linear geometry, which makes manifold identification tractable. It is far less clear that richer, higher-dimensional concepts — sentiment, deception, care-like states — occupy manifolds with recoverable topology, and the paper does not demonstrate manifold steering on any such case. The claim that the method generalizes to safety-relevant interventions therefore outruns the evidence. A skeptic would also note that the advantage of manifold over linear steering is shown qualitatively (probability mass plots) in the primary days-of-week case; a quantitative ablation across steering magnitude and layer depth on a held-out concept class would be needed to establish how large the advantage is and when it collapses.
Findings (2)
- Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.
Core empirical claim comparing steering approaches on cyclic concepts.
- Days-of-Week Cyclic Structure
Key empirical result: days-of-week appear as identical circular manifold in both Llama-3.1-8B internal activations and output token probability distributions.
Claims (2)
- Conceptual geometry is consistent across representation space and behavior space.
Interpretive assertion: the same geometric structure (e.g. circular for days) appears identically in both internal activations and output probabilities.
- Networks compute on geometric manifolds and control should respect that geometry.
Strong interpretive assertion linking discovery and control: neural computation is fundamentally manifold-structured.
Original abstract (expand)
Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and a behavior manifold $M_y$ to output probability distributions. We then test the link $M_h \leftrightarrow M_y$ via interventions: we find that steering along $M_h$, which we term manifold steering, yields behavioral trajectories that follow $M_y$, while linear steering -- which assumes a Euclidean geometry -- cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along $M_y$ recovers activation trajectories that trace the curvature of $M_h$. We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.