method
active
method:dose-response-feature-steering-protocolDose-Response Feature Steering Protocol
Varying each feature's activation from -0.6 to +0.6, averaging over 10 random seeds per setting
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- SAE Feature SteeringimplementsMethod of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Artifacts (1)
artifact
- Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Framework of using internal-state representations to control or steer generative models; conceptually parallel to manifold steering in language models.
- The method can steer the model in both positive and negative directions on the target semantic.
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- General technique of modifying activations to control model behavior.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Practical guidance for practitioners who lack ground-truth model organisms.