finding
active
finding:steering-vectors-from-0-2-slightly-outperform-1-2-for-instruction-discovery-across-datasets-and-modelsSteering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and models
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
Source paper
extracted_from(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan
Neighborhood — ranked by edge-count
Claims (1)
claim
- Empirical interpretation of which reference baseline yields more useful steering vectors.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability