finding
active
finding:activation-steering-works-on-sdf-only-model-organism-before-expert-iteration-with-steering-strength-0-4Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4
Replicates main result on simpler model; qualitatively similar patterns.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper; supported by the model organism ground-truth approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.819Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Foundational paper introducing activation steering methodology used in this work
- Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.