concept
active
concept:latent-direction-of-reflection

Latent Direction of Reflection

The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Methods (1)

method
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Concepts (3)

concept
  • A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
  • steering vectors
    associated_with
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.