concept
active
concept:steering-vectorssteering vectors
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (4)
method
- Activation SteeringimplementsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- linear steeringimplementsTypical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
- Contrastive Activation Steeringassociated_withCore technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
- Linear Artificial Tomography (LAT)introducesMethod for extracting deception steering vectors via PCA on contrastive activation differences; achieves 89% detection accuracy
Concepts (7)
concept
- Eval Awarenessassociated_withCentral concept: models' detection and behavioral response to being evaluated.
- Latent Direction of Reflectionassociated_withThe paper's central construct: a vector in LLM activation space encoding the transition between reflection levels.
- ActAdd (Activation Addition)implementsMethod by Turner et al. for real-time output control via activation engineering, cited as foundation for this paper's steering approach
- alternative user personasassociated_withUnintended personas introduced as a side effect of using steering vectors to reduce eval awareness.
- Deception Vectorassociated_withExtracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1
- Function Vectorassociated_withType of steering vector enabling zero-shot task execution, cited from Todd et al. 2024
- Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Vectors acquired during pretraining in Backpack LMs that have a multiplication effect on model generation
- Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Paradigm of finding the right geometry (manifold) for principled control.