claim
active
claim:activation-steering-can-make-an-evaluation-aware-model-act-as-if-deployed-not-merely-suppress-verbalizations-of-evaluation-awarenessActivation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
Central claim of the paper; supported by the model organism ground-truth approach.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Findings (5)
finding
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationsupportsShows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Replicates main result on simpler model; qualitatively similar patterns.
- Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
Claims (1)
claim
- Policy recommendation derived from experimental results.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Foundational paper introducing activation steering methodology used in this work
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Practical guidance for practitioners who lack ground-truth model organisms.