claim

active

claim:linear-interventions-across-voids-in-activation-space-produce-incoherent-output-while-following-the-manifold-curve-produces-smooth-control

Linear interventions across voids in activation space produce incoherent output, while following the manifold curve produces smooth control.

General principle derived from the Mountain Car experiment: curved manifold-following yields coherent manipulation, linear shortcuts fail.

Source paper

extracted_from

The World Inside Neural Networks

(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3

Neighborhood — ranked by edge-count

Findings (1)

finding

In the Mountain Car case study, car position is a 1D manifold; linear interventions cross voids causing incoherence; following the 1D curve produces smooth control.
supports
Empirical demonstration that a semantically meaningful variable is encoded as a curved manifold, and that respecting its geometry is critical for effective intervention.

Concepts (6)

concept

The Void
cites
The property that the most profound centers have at their heart a void like water, infinite in depth, surrounded by and contrasted with the clutter around it; the calm emptiness needed by every center to give it the basis of its strength
Activation space
cites
Representation space on which linear probes operate to attribute harmful behaviors to training data.
manifold
cites
A smooth, potentially curved surface in activation space along which activations vary according to a coherent semantic dimension.
incoherence
cites
Nonsensical or unphysical model outputs that result when interventions cross voids in activation space.
linear intervention
cites
Manipulation of activations along a straight line; shown to fail when it crosses voids, in contrast to manifold-following interventions.
smooth control
cites
Coherent, predictable changes in model behavior achieved by navigating along the learned manifold rather than using straight-line interventions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h.finding0.823
Demonstrates bidirectional causal link: behavior manifold geometry can be recovered by optimizing in representation space.
Interventions along activation manifold M_h yield behavioral trajectories following behavior manifold M_y, and vice versa — bidirectional relationship demonstrated across language models and video world models.finding0.802
Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
We hypothesize that interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturallyhypothesis0.792
The core testable hypothesis driving the experimental design
Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.finding0.786
Core empirical claim comparing steering approaches on cyclic concepts.
Activation manifolds and behavior manifolds are approximately isometric across cyclic and sequential concepts.claim0.772
Representation geometry causally shapes behavior; activation and behavior manifolds are approximately isometric.claim0.771
manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effectsfinding0.768
Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activationsclaim0.766
Limitation acknowledgment about the adequacy of the linear representation assumption