claim
active
claim:steering-vectors-capture-latent-dimensions-of-reflective-behavior-more-faithfully-than-surface-level-embedding-similaritySteering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Source paper
extracted_from(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan
Neighborhood — ranked by edge-count
Findings (2)
finding
- Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
- Demonstrates the failure mode of surface-level similarity for instruction discovery.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.