method
active
method:feature-steering-clamping-feature-activationsFeature steering (clamping feature activations)
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
- Foundational paper introducing activation steering methodology used in this work
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.774Replicates main result on simpler model; qualitatively similar patterns.
- Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.767Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.