finding

active

finding:four-best-contrastive-prompt-pairs-outperform-full-16-pair-average-steering-vector-for-type-hint-suppression

Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppression

Optimization result for steering vector construction.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.881
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.822
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vectorfinding0.814
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.791
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.767
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratefinding0.760
Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system promptfinding0.757
Demonstrates that alignment faking setup functions as an effective jailbreak
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.753
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.