finding
active
finding:in-the-mountain-car-case-study-car-position-is-a-1d-manifold-linear-interventions-cross-voids-causing-incoherence-following-the-1d-curve-produces-smooth-control

In the Mountain Car case study, car position is a 1D manifold; linear interventions cross voids causing incoherence; following the 1D curve produces smooth control.

Empirical demonstration that a semantically meaningful variable is encoded as a curved manifold, and that respecting its geometry is critical for effective intervention.

Source paper

extracted_from
The World Inside Neural Networks
(2026) · Geiger, Atticus · Lubana, Ekdeep Singh · Fel, Thomas · Merullo, Jack +3

Neighborhood — ranked by edge-count

Claims (3)

claim

Communities (2)

community

Concepts (8)

concept
  • Benchmark demonstrating that car position encodes as a 1D manifold; linear interventions fail by crossing voids; following the geometric curve produces smooth control.
  • The property that the most profound centers have at their heart a void like water, infinite in depth, surrounded by and contrasted with the clutter around it; the calm emptiness needed by every center to give it the basis of its strength
  • Representation space on which linear probes operate to attribute harmful behaviors to training data.
  • A single-continuous curve in activation space encoding a single variable, such as car position in the Mountain Car case.
  • Nonsensical or unphysical model outputs that result when interventions cross voids in activation space.
  • Manipulation of activations along a straight line; shown to fail when it crosses voids, in contrast to manifold-following interventions.
  • Coherent, predictable changes in model behavior achieved by navigating along the learned manifold rather than using straight-line interventions.
  • The scalar variable representing the car's location in the Mountain Car reinforcement learning problem, found to be encoded as a 1D manifold.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.