claim

active

claim:steering-vectors-enable-systematic-discovery-of-reflection-inducing-instructions-beyond-trial-and-error-prompt-design

Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.

Core applied contribution claim, supported by top-k accuracy comparisons.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Findings (4)

finding

Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selection
supports
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
Input embedding similarity baseline selects semantically related but non-reflective tokens (e.g., Await, ConfigureAwait, Unchecked) that fail to improve accuracy
supports
Demonstrates the failure mode of surface-level similarity for instruction discovery.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasets
associated_with
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-IT
supports
High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.

Questions (1)

question

How can we systematically identify effective reflection trigger instructions, rather than relying on trial-and-error?
gates
First key research question motivating the methodology.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.875
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.818
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.813
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.806
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
NLA-derived steering vectors from edited explanations can causally shift planning representations, changing rhyme completion from 'rabbit' to 'mouse' at ~50% success rate.finding0.803
Evidence that NLA explanations bear causal relationship to model outputs; demonstrates validity of extracted representations.
Steering vectors to reduce eval awareness can inadvertently insert alternative user personasclaim0.801
Caution: interventions targeting eval awareness may have unintended side effects.
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.796
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.claim0.796
Applied dual-use conclusion drawn from the paper's findings.