claim
active
claim:linear-interventions-across-voids-in-activation-space-produce-incoherent-output-while-following-the-manifold-curve-produces-smooth-controlLinear interventions across voids in activation space produce incoherent output, while following the manifold curve produces smooth control.
General principle derived from the Mountain Car experiment: curved manifold-following yields coherent manipulation, linear shortcuts fail.
Source paper
extracted_from(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3
Neighborhood — ranked by edge-count
Findings (1)
finding
- Empirical demonstration that a semantically meaningful variable is encoded as a curved manifold, and that respecting its geometry is critical for effective intervention.
Concepts (6)
concept
- The VoidcitesThe property that the most profound centers have at their heart a void like water, infinite in depth, surrounded by and contrasted with the clutter around it; the calm emptiness needed by every center to give it the basis of its strength
- Activation spacecitesRepresentation space on which linear probes operate to attribute harmful behaviors to training data.
- manifoldcitesA smooth, potentially curved surface in activation space along which activations vary according to a coherent semantic dimension.
- incoherencecitesNonsensical or unphysical model outputs that result when interventions cross voids in activation space.
- linear interventioncitesManipulation of activations along a straight line; shown to fail when it crosses voids, in contrast to manifold-following interventions.
- smooth controlcitesCoherent, predictable changes in model behavior achieved by navigating along the learned manifold rather than using straight-line interventions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates bidirectional causal link: behavior manifold geometry can be recovered by optimizing in representation space.
- Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
- The core testable hypothesis driving the experimental design
- Core empirical claim comparing steering approaches on cyclic concepts.
- Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
- Limitation acknowledgment about the adequacy of the linear representation assumption