concept
active
concept:composite-steering-vector-v-role-truthComposite Steering Vector (v_role-truth)
Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics
Neighborhood — ranked by edge-count
Concepts (1)
concept
- steering vectorsextendsA method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Paradigm of finding the right direction in activation space (e.g., linear steering).