question

active

question:what-if-the-concept-being-manipulated-does-not-lie-on-a-straight-line-in-the-model-s-representations

What if the concept being manipulated does not lie on a straight line in the model's representations?

The motivating question that opens the paper and leads to the development of manifold steering.

Source paper

extracted_from

Steering Along Manifolds to Control Neural Networks

Neighborhood — ranked by edge-count

Papers (1)

paper

Steering Along Manifolds to Control Neural Networks
introduces

Claims (1)

claim

Steering along manifolds provides better control than linear steering when the concept geometry is non-linear.
gates
The central thesis of the paper, motivating the shift from linear to geometry-aware manifold steering.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.799
Motivating hypothesis for Section 5's investigation of prompt template effects.
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.789
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
When it is not okay, how can we prevent divergent representations from occurring?question0.786
Third core research question motivating the CL loss approach in Section 5
Representation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representationsclaim0.786
Key interpretive claim that deception has a tractable geometric signature in activation space
We hypothesize that representation geometry drives model behavior — the geometric structure of internal representations causally shapes what models do externally.hypothesis0.786
The causal hypothesis motivating the use of causality (intervention) as the lens connecting representation and behavior geometry.
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.786
Interpretive synthesis of DIM and cone intervention successes
There is a bidirectional relationship between the geometry of representation and behavior across tasks and modalities.claim0.781
Author’s interpretive claim that the shared geometry is general and robust.
Given a language model M and a statement s, does M believe s to be true?question0.780
The core motivating question of the paper, framed by Christiano et al. (2021)