paper
active
paper:steering

Steering Along Manifolds to Control Neural Networks

TL;DR

Steering along curved manifolds in representation space outperforms linear steering for concepts with non-linear geometry, as demonstrated on Llama-3.1 8B across cyclic concepts including days of the week, months, letters, ages, and a synthetic in-context learning task. The paper introduces manifold steering, a method that fits a one-dimensional manifold (path) to a model's internal activation space and steers along it rather than adding a flat steering vector. When applied to the days-of-the-week task, manifold steering cleanly shifts probability mass through the sequence Monday→Tuesday→…→Sunday, tracking the circular behavior manifold in output token distribution space (measured with Hellinger distance), while linear steering cuts across that manifold and produces off-target outputs that are not even days of the week. Crucially, the paper shows a bidirectional correspondence: paths derived from the representation manifold and paths derived independently by optimizing for behavior-manifold outputs converge to near-identical curves in activation space, demonstrating that internal representation geometry and external behavioral geometry are tightly coupled, not incidentally related. Results also extend beyond language to an image-action model predicting car position on a hill. The paper argues this implies that representation geometry is a principled, causally grounded blueprint for controlling neural network behavior, not merely a post-hoc descriptive artifact.

What to take away

  1. 1. Llama-3.1 8B encodes the days of the week in a circular manifold in both its internal activation space and its output token-distribution space (measured via Hellinger distance), with nearby days receiving higher probability mass when any given day is the model's predicted output.
  2. 2. Manifold steering—fitting a 1-D path to activation space and moving along it—cleanly shifts Llama-3.1 8B's predicted day token through the full Monday-to-Sunday cycle, whereas linear steering produces off-target outputs that are not days of the week at all.
  3. 3. When behavior-manifold paths (derived by optimizing interventions to track output distributions) are mapped back into activation space, they align strikingly closely with representation-manifold paths derived purely from internal activations, establishing a bidirectional geometry correspondence.
  4. 4. The cyclic structure of days of the week emerges because 'Monday' co-occurs more frequently with 'Sunday' and 'Tuesday' than with other days in pretraining text, and this statistical regularity propagates into both representations and behavior.
  5. 5. The paper replicates the manifold-behavior correspondence across months, letters, ages, and a synthetic in-context learning task with predefined geometries in addition to days of the week, showing the result is not an artifact of a single concept.
  6. 6. Results extend across modalities: an image-action model predicting car position on a hill also exhibits manifold geometry alignment between representations and behavior, generalizing the framework beyond language models.
  7. 7. To fit manifolds to behavior space, the authors model clusters of output token distributions for each day using a path fitted in Hellinger-distance geometry, a methodology directly replicable for any discrete cyclic concept by substituting the relevant token set and a suitable divergence metric.
  8. 8. An open question the paper raises is whether the precise quantitative degree of alignment between representation and behavior manifolds varies systematically with concept complexity or model scale, as the current results are demonstrated on Llama-3.1 8B without ablations across model sizes.
  9. 9. The paper predicts that any concept whose pretraining data imposes a non-linear (e.g., cyclic, toroidal) statistical structure will produce analogously shaped manifolds in both activation space and behavior space, making representation geometry a general predictive tool for behavioral control.
  10. 10. The full paper (arXiv 2605.05115) extends analysis to a mountain-car image-action task where steering along the positional manifold controls predicted car motion, providing a non-linguistic validation of the geometry-steering framework.

Peer brief — for seminar discussion

This work investigates whether the geometric structure of a neural network's internal representations causally mirrors the geometry of its output behavior, using days of the week as the primary case study and extending to months, letters, ages, a synthetic in-context learning task, and a visual car-position prediction model. The core method introduced is manifold steering: rather than adding a fixed steering vector along a straight line in activation space (the standard approach), a 1-D path is fit to the activation-space clusters corresponding to each concept value, and interventions move a hidden representation along that curved path. The experimental substrate is Llama-3.1 8B, probed on arithmetic queries of the form 'What day comes N days after X?' Output token distributions are embedded in a Hellinger-distance geometry and also exhibit a circular arrangement across the seven days. The load-bearing finding is a bidirectional correspondence: manifold steering along the representation path produces outputs that track the circular behavior manifold cleanly (shifting probability mass Monday→Tuesday→…→Sunday), while linear steering diverges off the behavior manifold and generates non-day tokens. More striking, when behavior-manifold paths are constructed independently—by optimizing activation-space interventions to match output distributions along the behavioral circle—they converge to nearly the same curve in activation space as the representation-derived path. This alignment holds across 5+ concept domains and across at least two modalities (language and image-action). The implication the paper defends is that representation geometry is not merely a descriptive correlate of behavior but a causally grounded blueprint: knowing the shape of internal manifolds tells you how to steer outputs, and the shapes are predictable from the statistical structure of pretraining data. This connects to prior work by Engels et al. (2024) and Karkada et al. (2026) on circular and geometric structure in language model representations. An alternative method that could have been used is activation patching or causal tracing (as in ROME/MEMIT-style editing), which also intervenes on representations but does not exploit geometric structure; comparing quantitative steering precision against that baseline would sharpen the claimed advantage of manifold steering. A critical reader would push back on the scope of the alignment claim: all demonstrated cases involve concepts with clean, low-dimensional, human-imposed cyclic or ordinal structure (days, months, letters, ages). It is not clear that the tight representation-behavior geometry correspondence generalizes to higher-dimensional or non-ordinal semantic concepts, where pretraining co-occurrence structure is far messier. The paper raises but does not resolve whether the degree of alignment degrades with concept complexity or model scale, and it does not ablate across different Llama sizes or architectures, leaving the generality of the core claim underspecified for the peer community.

Methods (2)

  • Next-Day Arithmetic Task
    The evaluation task used to probe Llama's representation of days of the week: questions of the form 'What day comes N days after X?'
  • Pullback Steering
    The method of optimizing steering interventions in activation space to produce outputs that follow the behavior manifold, independent of the representation manifold.

Frameworks (1)

  • Geometry-Aware Steering Framework
    The overarching theoretical framework proposed in the paper, asserting that steering interventions should be aligned with the geometric structure of the model's representation manifold.

Findings (9)

Claims (8)

Hypotheses (3)

Questions (3)

Original abstract (expand)

Intervening on a model's internal representations to steer behavior, known as representation steering, promises lightweight, adaptable, and granular control of neural networks. While the typical approach uses linear steering by adding a scaled steering vector to hidden representations, this paper argues that steering along manifolds—curved surfaces in representation space—provides better control. Using the cyclic concept geometry of days of the week as a case study, the authors demonstrate that geometry-aware steering reveals a deep connection between the geometry of neural network behavior and representation.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar