claim
active
claim:linear-interventions-across-voids-in-activation-space-produce-incoherent-output-while-following-the-manifold-curve-produces-smooth-control

Linear interventions across voids in activation space produce incoherent output, while following the manifold curve produces smooth control.

General principle derived from the Mountain Car experiment: curved manifold-following yields coherent manipulation, linear shortcuts fail.

Source paper

extracted_from
The World Inside Neural Networks
(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3

Neighborhood — ranked by edge-count

Findings (1)

finding

Concepts (6)

concept
  • The property that the most profound centers have at their heart a void like water, infinite in depth, surrounded by and contrasted with the clutter around it; the calm emptiness needed by every center to give it the basis of its strength
  • Representation space on which linear probes operate to attribute harmful behaviors to training data.
  • A smooth, potentially curved surface in activation space along which activations vary according to a coherent semantic dimension.
  • Nonsensical or unphysical model outputs that result when interventions cross voids in activation space.
  • Manipulation of activations along a straight line; shown to fail when it crosses voids, in contrast to manifold-following interventions.
  • Coherent, predictable changes in model behavior achieved by navigating along the learned manifold rather than using straight-line interventions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.