claim

active

claim:steering-vectors-capture-latent-dimensions-of-reflective-behavior-more-faithfully-than-surface-level-embedding-similarity

Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.

Supported by the instruction discovery experiments comparing steering vs. embedding baselines.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Findings (2)

finding

Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selection
supports
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
Input embedding similarity baseline selects semantically related but non-reflective tokens (e.g., Await, ConfigureAwait, Unchecked) that fail to improve accuracy
supports
Demonstrates the failure mode of surface-level similarity for instruction discovery.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.875
Core applied contribution claim, supported by top-k accuracy comparisons.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.842
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.820
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.818
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.810
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.809
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.806
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.806
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.