Feature steering (clamping feature activations)

Modifying model behavior by clamping SAE feature activations to specific values during forward pass.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Steeringmethod0.827
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Contrastive Activation Steeringmethod0.809
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.801
Foundational paper introducing activation steering methodology used in this work
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.775
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.774
Replicates main result on simpler model; qualitatively similar patterns.
SAE Feature Steeringframework0.772
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Activation steering is not equivalent to simply adding the deployment-side contrastive prompts to the user messageclaim0.772
Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.767
Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.