finding

active

finding:steering-at-6-layers-strength-0-6-each-total-3-6-outperforms-single-layer-steering-at-equivalent-total-strength-for-type-hint-suppression

Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppression

Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Distributing steering strength across multiple layers (6 layers at 0.6 each) is more effective and less accuracy-damaging than concentrating the same total strength in one layer
supports
Practical finding for optimizing steering setup.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.822
Optimization result for steering vector construction.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.806
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Steering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurationsfinding0.777
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationclaim0.775
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.774
Argues against the single-layer analysis approach of prior work.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.773
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vectorfinding0.772
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratefinding0.767
Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.