steering vectors

A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.

Neighborhood — ranked by edge-count

paper

method

Activation Steering
implements
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
linear steering
implements
Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
Contrastive Activation Steering
associated_with
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
Linear Artificial Tomography (LAT)
introduces
Method for extracting deception steering vectors via PCA on contrastive activation differences; achieves 89% detection accuracy

concept

Eval Awareness
associated_with
Central concept: models' detection and behavioral response to being evaluated.
Latent Direction of Reflection
associated_with
The paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
ActAdd (Activation Addition)
implements
Method by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
alternative user personas
associated_with
Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
Deception Vector
associated_with
Extracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1
Function Vector
associated_with
Type of steering vector enabling zero-shot task execution, cited from Todd et al. 2024
Composite Steering Vector (v_role-truth)
extends
Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sense Vectorsconcept0.813
Vectors acquired during pretraining in Backpack LMs that have a multiplication effect on model generation
Contrastive Steering Vector Constructionmethod0.808
Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.805
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
direction-based steeringconcept0.805
Paradigm of finding the right direction in activation space (e.g., linear steering).
Model Steeringconcept0.800
Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.799
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.794
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
geometry-based steeringconcept0.787
Paradigm of finding the right geometry (manifold) for principled control.