paper:steeringSteering Along Manifolds to Control Neural Networks
TL;DR
Steering along curved manifolds in representation space outperforms linear steering for concepts with non-linear geometry, as demonstrated on Llama-3.1 8B across cyclic concepts including days of the week, months, letters, ages, and a synthetic in-context learning task. The paper introduces manifold steering, a method that fits a one-dimensional manifold (path) to a model's internal activation space and steers along it rather than adding a flat steering vector. When applied to the days-of-the-week task, manifold steering cleanly shifts probability mass through the sequence Monday→Tuesday→…→Sunday, tracking the circular behavior manifold in output token distribution space (measured with Hellinger distance), while linear steering cuts across that manifold and produces off-target outputs that are not even days of the week. Crucially, the paper shows a bidirectional correspondence: paths derived from the representation manifold and paths derived independently by optimizing for behavior-manifold outputs converge to near-identical curves in activation space, demonstrating that internal representation geometry and external behavioral geometry are tightly coupled, not incidentally related. Results also extend beyond language to an image-action model predicting car position on a hill. The paper argues this implies that representation geometry is a principled, causally grounded blueprint for controlling neural network behavior, not merely a post-hoc descriptive artifact.
What to take away
- 1. Llama-3.1 8B encodes the days of the week in a circular manifold in both its internal activation space and its output token-distribution space (measured via Hellinger distance), with nearby days receiving higher probability mass when any given day is the model's predicted output.
- 2. Manifold steering—fitting a 1-D path to activation space and moving along it—cleanly shifts Llama-3.1 8B's predicted day token through the full Monday-to-Sunday cycle, whereas linear steering produces off-target outputs that are not days of the week at all.
- 3. When behavior-manifold paths (derived by optimizing interventions to track output distributions) are mapped back into activation space, they align strikingly closely with representation-manifold paths derived purely from internal activations, establishing a bidirectional geometry correspondence.
- 4. The cyclic structure of days of the week emerges because 'Monday' co-occurs more frequently with 'Sunday' and 'Tuesday' than with other days in pretraining text, and this statistical regularity propagates into both representations and behavior.
- 5. The paper replicates the manifold-behavior correspondence across months, letters, ages, and a synthetic in-context learning task with predefined geometries in addition to days of the week, showing the result is not an artifact of a single concept.
- 6. Results extend across modalities: an image-action model predicting car position on a hill also exhibits manifold geometry alignment between representations and behavior, generalizing the framework beyond language models.
- 7. To fit manifolds to behavior space, the authors model clusters of output token distributions for each day using a path fitted in Hellinger-distance geometry, a methodology directly replicable for any discrete cyclic concept by substituting the relevant token set and a suitable divergence metric.
- 8. An open question the paper raises is whether the precise quantitative degree of alignment between representation and behavior manifolds varies systematically with concept complexity or model scale, as the current results are demonstrated on Llama-3.1 8B without ablations across model sizes.
- 9. The paper predicts that any concept whose pretraining data imposes a non-linear (e.g., cyclic, toroidal) statistical structure will produce analogously shaped manifolds in both activation space and behavior space, making representation geometry a general predictive tool for behavioral control.
- 10. The full paper (arXiv 2605.05115) extends analysis to a mountain-car image-action task where steering along the positional manifold controls predicted car motion, providing a non-linguistic validation of the geometry-steering framework.
Peer brief — for seminar discussion
This work investigates whether the geometric structure of a neural network's internal representations causally mirrors the geometry of its output behavior, using days of the week as the primary case study and extending to months, letters, ages, a synthetic in-context learning task, and a visual car-position prediction model. The core method introduced is manifold steering: rather than adding a fixed steering vector along a straight line in activation space (the standard approach), a 1-D path is fit to the activation-space clusters corresponding to each concept value, and interventions move a hidden representation along that curved path. The experimental substrate is Llama-3.1 8B, probed on arithmetic queries of the form 'What day comes N days after X?' Output token distributions are embedded in a Hellinger-distance geometry and also exhibit a circular arrangement across the seven days. The load-bearing finding is a bidirectional correspondence: manifold steering along the representation path produces outputs that track the circular behavior manifold cleanly (shifting probability mass Monday→Tuesday→…→Sunday), while linear steering diverges off the behavior manifold and generates non-day tokens. More striking, when behavior-manifold paths are constructed independently—by optimizing activation-space interventions to match output distributions along the behavioral circle—they converge to nearly the same curve in activation space as the representation-derived path. This alignment holds across 5+ concept domains and across at least two modalities (language and image-action). The implication the paper defends is that representation geometry is not merely a descriptive correlate of behavior but a causally grounded blueprint: knowing the shape of internal manifolds tells you how to steer outputs, and the shapes are predictable from the statistical structure of pretraining data. This connects to prior work by Engels et al. (2024) and Karkada et al. (2026) on circular and geometric structure in language model representations. An alternative method that could have been used is activation patching or causal tracing (as in ROME/MEMIT-style editing), which also intervenes on representations but does not exploit geometric structure; comparing quantitative steering precision against that baseline would sharpen the claimed advantage of manifold steering. A critical reader would push back on the scope of the alignment claim: all demonstrated cases involve concepts with clean, low-dimensional, human-imposed cyclic or ordinal structure (days, months, letters, ages). It is not clear that the tight representation-behavior geometry correspondence generalizes to higher-dimensional or non-ordinal semantic concepts, where pretraining co-occurrence structure is far messier. The paper raises but does not resolve whether the degree of alignment degrades with concept complexity or model scale, and it does not ablate across different Llama sizes or architectures, leaving the generality of the core claim underspecified for the peer community.
Methods (2)
- Next-Day Arithmetic TaskThe evaluation task used to probe Llama's representation of days of the week: questions of the form 'What day comes N days after X?'
- Pullback SteeringThe method of optimizing steering interventions in activation space to produce outputs that follow the behavior manifold, independent of the representation manifold.
Frameworks (1)
- Geometry-Aware Steering FrameworkThe overarching theoretical framework proposed in the paper, asserting that steering interventions should be aligned with the geometric structure of the model's representation manifold.
Findings (9)
- Steering Llama-3.1 8B along the circular representation manifold produces outputs that follow the natural circle of the behavior manifold, cleanly shifting probability mass from Monday through successive days.
Core empirical result demonstrating that manifold steering produces on-target, behavior-aligned outputs.
- Linear steering on Llama-3.1 8B for the days-of-week task cuts across the behavior manifold, producing noisy off-target effects where predicted tokens are not even days of the week.
Empirical result demonstrating the failure mode of linear steering when concept geometry is cyclic.
- Analogous alignment between representation manifold and behavior manifold is found across months, letters, ages, and synthetic in-context learning tasks in language models.
Generalization finding from the full paper extending beyond days-of-week to other structured concepts.
- The representation-based path and the behavior-based path in Llama-3.1 8B activation space trace out similar curves, demonstrating bidirectional geometry alignment.
Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
- manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effects
Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
- Manifold geometry provides a practical steering blueprint in an image-action model predicting car position on a hill, extending results across modalities.
Cross-modality result from the full paper demonstrating that representation-behavior geometry alignment is not limited to language models.
- Llama-3.1 8B output token distributions for seven days of the week form seven clusters in a rough circle in behavior space (Hellinger distance geometry).
Empirical observation establishing that Llama's behavior for days-of-week tasks has circular structure.
- Llama-3.1 8B internal representations for the seven days of the week form seven clusters in a circle in activation space.
Empirical observation establishing that Llama's internal representations for days-of-week have circular geometric structure.
- manifold geometry principles extend to months, letters, ages, and in-context learning tasks across modalities
Evidence that the weekday cyclic structure is not anomalous but reflects broader principle of concept geometry.
Claims (8)
- The geometry of internal representations and the geometry of model behavior share a precise correspondence — representation geometry is a window into the inner world of neural networks.
The paper's deepest interpretive claim, asserting that representation structure and behavioral structure are not coincidentally aligned but deeply connected.
- The alignment between representation geometry and behavior geometry is not limited to days of the week but extends to months, letters, ages, and synthetic in-context learning tasks.
The paper's generalization claim, asserting that the days-of-week finding scales to other cyclic and structured concepts.
- There is a clear bidirectional relationship between the geometry of behavior and representation: steering along representation manifolds follows behavior manifolds, and vice versa.
The paper's finding that the alignment holds in both directions — from representation to behavior and from behavior back to representation space.
- The analogous circular structure between representation manifold and behavior manifold is downstream of training data and its implicit cyclic conceptual structure.
The paper's causal explanation for why representation and behavior geometry both appear circular for days of the week.
- Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Steering along manifolds provides better control than linear steering when the concept geometry is non-linear.
The central thesis of the paper, motivating the shift from linear to geometry-aware manifold steering.
- geometric structure in neural network representations drives model behavior
Interpretive assertion that representation geometry is not epiphenomenal but causally shapes what models do externally.
- representation geometry and behavior geometry are bidirectionally aligned
Core finding: the structure models use internally (representations) is precisely reflected in their external behavior (outputs).
Hypotheses (3)
- We hypothesize that representation geometry drives model behavior — the geometric structure of internal representations causally shapes what models do externally.
The causal hypothesis motivating the use of causality (intervention) as the lens connecting representation and behavior geometry.
- Manifold geometry provides a practical blueprint for steering model behavior across diverse tasks and modalities.
The generalizing predictive claim that manifold steering is a broadly applicable framework beyond the days-of-week case study.
- language models recapitulate cyclic structure of human concepts from pretraining data
Explanation for why manifold geometry emerges: implicit structure in training data (co-occurrence patterns) shapes internal representations.
Questions (3)
- What if the concept being manipulated does not lie on a straight line in the model's representations?
The motivating question that opens the paper and leads to the development of manifold steering.
- How do interventions on representations causally steer behavior?
Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
- How does representation geometry causally drive model behavior?
The central scientific question the paper addresses through the lens of interventional causality.
Original abstract (expand)
Intervening on a model's internal representations to steer behavior, known as representation steering, promises lightweight, adaptable, and granular control of neural networks. While the typical approach uses linear steering by adding a scaled steering vector to hidden representations, this paper argues that steering along manifolds—curved surfaces in representation space—provides better control. Using the cyclic concept geometry of days of the week as a case study, the authors demonstrate that geometry-aware steering reveals a deep connection between the geometry of neural network behavior and representation.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 93%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and BehaviorCan Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana Daniel Wurgaft2026≈ 92%
- The World Inside Neural Networksin corpus2026≈ 87%
- Curveball Steering: The Right Direction To Steer Isn't Always LinearHae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, Amirali Abdullah Shivam Raval2026≈ 85%
- Mitigating Overthinking in Large Reasoning Models via Manifold SteeringHuanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong Yao Huang2025≈ 83%
- Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation ModelPayel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer Rio Alexa Fear2025≈ 82%
- Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural RepresentationsHaijiang Yan, Thomas L. Griffiths Jian-Qiao Zhu2025≈ 82%
- ≈ 81%
- In-Distribution Steering: Balancing Control and Coherence in Language Model GenerationBenjamin Wong, Yann Choho, Annabelle Blangero, Milan Bhan Arthur Vogels2025≈ 81%
- Beyond Steering Vector: Flow-based Activation Steering for Inference-Time InterventionRuixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang Zehao Jin2026≈ 81%
- ≈ 81%
- Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMsMostafa Elhoushi, Amr Alanwar Amr Hegazy2026≈ 81%
- Representational Curvature Modulates Behavioral Uncertainty in Large Language ModelsEvelina Fedorenko, Eghbal A. Hosseini Jack King2026≈ 81%
- ≈ 80%
- ≈ 80%
- Are the Values of LLMs Structurally Aligned with Humans? A Causal PerspectiveJunqi Wang, Yexin Li, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Tingjun Wu, Xue Feng, Fangwei Zhong, Zilong Zheng Yipeng Kang2025≈ 80%
- HyperSteer: Activation Steering at Scale with HypernetworksSidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger Jiuding Sun2025≈ 80%
- ≈ 80%
- Improving Steering Vectors by Targeting Sparse Autoencoder FeaturesSviatoslav Chalnev and Matthew Siu and Arthur Conmy2024≈ 80%
- Psychological Steering of Large Language Modelsin corpus2026≈ 80%
- The Platonic Representation Hypothesisin corpus2024≈ 79%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 79%
- ≈ 78%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 78%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 78%