Composite Steering Vector (v_role-truth)

Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics

Neighborhood — ranked by edge-count

Concepts (1)

concept

steering vectors
extends
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.797
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.783
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.777
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Contrastive Steering Vector Constructionmethod0.776
Method for computing steering vectors as mean activation differences between reflection levels at a given layer.
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.772
Core applied contribution claim, supported by top-k accuracy comparisons.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.763
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.752
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
direction-based steeringconcept0.740
Paradigm of finding the right direction in activation space (e.g., linear steering).