finding
active
finding:steering-vectors-discover-effective-triggers-such-as-however-and-otherwise-consistent-with-prior-reported-reflection-datasetsSteering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasets
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Source paper
extracted_from(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan
Neighborhood — ranked by edge-count
Claims (1)
claim
- Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.associated_withCore applied contribution claim, supported by top-k accuracy comparisons.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.819A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Open question arising from the 100% accuracy on specific concept-layer-strength combinations
- Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.811Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.