finding

active

finding:steering-vectors-from-0-2-slightly-outperform-1-2-for-instruction-discovery-across-datasets-and-models

Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and models

Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (1)

claim

Contrasting No Reflection with Triggered Reflection (µ(0→2)) provides a stronger reflection signal than contrasting Intrinsic with Triggered Reflection (µ(1→2)).
supports
Empirical interpretation of which reference baseline yields more useful steering vectors.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selectionfinding0.830
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.811
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.806
Core applied contribution claim, supported by top-k accuracy comparisons.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.791
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.785
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.780
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vectorfinding0.778
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.773
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability