claim

active

claim:distributing-steering-strength-across-multiple-layers-6-layers-at-0-6-each-is-more-effective-and-less-accuracy-damaging-than-concentrating-the-same-total-strength-in-one-layer

Distributing steering strength across multiple layers (6 layers at 0.6 each) is more effective and less accuracy-damaging than concentrating the same total strength in one layer

Practical finding for optimizing steering setup.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Findings (1)

finding

Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppression
supports
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.798
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Inhibition steering produces larger accuracy drops than enhancement steering produces accuracy gains, across all models and datasets testedfinding0.779
Key asymmetry finding: suppressing reflection is easier than inducing it.
Performance is best when skipping both the first and last six layers when applying interventionclaim0.776
Empirical configuration finding from ablation study on layer selection
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.767
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Stepwise steering preserves accuracy while reducing cost, whereas all-token steering causes significant degradation at large intervention strengthsclaim0.765
Comparative claim between the two steering strategies
Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.finding0.762
Empirical comparison showing advantage of SAE features in low-data regime.
Enhancement steering consistently underperforms compared to directly providing explicit reflection instructions across all tested conditionsfinding0.762
Shows that activation steering does not fully replicate mechanisms triggered by explicit prompting.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.759
Argues against the single-layer analysis approach of prior work.