finding
active
finding:steering-at-6-layers-strength-0-6-each-total-3-6-outperforms-single-layer-steering-at-equivalent-total-strength-for-type-hint-suppressionSteering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppression
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Practical finding for optimizing steering setup.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.822Optimization result for steering vector construction.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
- Argues against the single-layer analysis approach of prior work.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
- Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratefinding0.767Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.