finding

active

finding:steering-vectors-discover-effective-triggers-such-as-however-and-otherwise-consistent-with-prior-reported-reflection-datasets

Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasets

Validates that steering vectors capture reflection semantics by finding tokens reported in related work.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (1)

claim

Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.
associated_with
Core applied contribution claim, supported by top-k accuracy comparisons.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.842
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.838
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.825
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.819
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.819
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
What shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?question0.818
Open question arising from the 100% accuracy on specific concept-layer-strength combinations
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.811
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.806
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.