finding

active

finding:inhibition-steering-produces-larger-accuracy-drops-than-enhancement-steering-produces-accuracy-gains-across-all-models-and-datasets-tested

Inhibition steering produces larger accuracy drops than enhancement steering produces accuracy gains, across all models and datasets tested

Key asymmetry finding: suppressing reflection is easier than inducing it.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (2)

claim

Suppressing reflection is considerably easier than inducing it, because inhibition requires the model to terminate reasoning while enhancement demands additional cognitive effort to re-examine reasoning trajectories.
associated_withsupports
Key asymmetry finding interpreted mechanistically by the authors.
Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.
supports
Applied dual-use conclusion drawn from the paper's findings.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.844
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Enhancement steering consistently underperforms compared to directly providing explicit reflection instructions across all tested conditionsfinding0.816
Shows that activation steering does not fully replicate mechanisms triggered by explicit prompting.
Stepwise steering achieves over 5% accuracy improvement compared to all-token intervention at similar token budgetfinding0.797
Key result demonstrating advantage of stepwise over all-token steering strategy
Stepwise steering preserves accuracy while reducing cost, whereas all-token steering causes significant degradation at large intervention strengthsclaim0.789
Comparative claim between the two steering strategies
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.786
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.784
Applied security implication derived from the asymmetry finding.
Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationclaim0.783
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
Distributing steering strength across multiple layers (6 layers at 0.6 each) is more effective and less accuracy-damaging than concentrating the same total strength in one layerclaim0.779
Practical finding for optimizing steering setup.