finding
active
finding:four-best-contrastive-prompt-pairs-outperform-full-16-pair-average-steering-vector-for-type-hint-suppressionFour best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppression
Optimization result for steering vector construction.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratefinding0.760Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
- Demonstrates that alignment faking setup functions as an effective jailbreak
- Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.753Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.