concept
active
concept:steering-vectors

steering vectors

A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.

Neighborhood — ranked by edge-count

Methods (4)

method
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
  • linear steering
    implements
    Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
  • Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
  • Method for extracting deception steering vectors via PCA on contrastive activation differences; achieves 89% detection accuracy

Concepts (7)

concept
  • Eval Awareness
    associated_with
    Central concept: models' detection and behavioral response to being evaluated.
  • The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
  • Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
  • Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
  • Deception Vector
    associated_with
    Extracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1
  • Function Vector
    associated_with
    Type of steering vector enabling zero-shot task execution, cited from Todd et al. 2024
  • Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.