finding

active

finding:activation-steering-interventions-generally-succeed-in-guiding-performance-toward-the-desired-direction-enhancement-increases-accuracy-inhibition-decreases-accuracy-compared-to-unsteered-baseline

Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baseline

Core validation that identified latent directions correspond to meaningful control over reflective behavior.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (1)

claim

Reflection is not merely a behavioral artifact of prompting but a phenomenon encoded in the model's activation space.
supports
Central interpretive claim of the paper, supported by steering vector experiments.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.855
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Inhibition steering produces larger accuracy drops than enhancement steering produces accuracy gains, across all models and datasets testedfinding0.844
Key asymmetry finding: suppressing reflection is easier than inducing it.
Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.837
Central claim of the paper; supported by the model organism ground-truth approach.
Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.803
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.803
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.800
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Stepwise steering achieves over 5% accuracy improvement compared to all-token intervention at similar token budgetfinding0.795
Key result demonstrating advantage of stepwise over all-token steering strategy
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.791
Applied security implication derived from the asymmetry finding.