claim
active
claim:steering-vectors-enable-systematic-discovery-of-reflection-inducing-instructions-beyond-trial-and-error-prompt-designSteering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.
Core applied contribution claim, supported by top-k accuracy comparisons.
Source paper
extracted_from(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan
Neighborhood — ranked by edge-count
Findings (4)
finding
- Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
- Demonstrates the failure mode of surface-level similarity for instruction discovery.
- Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsassociated_withValidates that steering vectors capture reflection semantics by finding tokens reported in related work.
- High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.
Questions (1)
question
- First key research question motivating the methodology.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.813A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
- Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.806Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
- Evidence that NLA explanations bear causal relationship to model outputs; demonstrates validity of extracted representations.
- Steering vectors to reduce eval awareness can inadvertently insert alternative user personasclaim0.801Caution: interventions targeting eval awareness may have unintended side effects.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Applied dual-use conclusion drawn from the paper's findings.