finding

active

finding:activation-steering-works-on-sdf-only-model-organism-before-expert-iteration-with-steering-strength-0-4

Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4

Replicates main result on simpler model; qualitatively similar patterns.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
supports
Central claim of the paper; supported by the model organism ground-truth approach.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.819
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.798
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.795
Foundational paper introducing activation steering methodology used in this work
Activation Steeringmethod0.786
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.780
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.779
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.777
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Feature steering (clamping feature activations)method0.774
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.