Dose-Response Feature Steering Protocol

Varying each feature's activation from -0.6 to +0.6, averaging over 10 random seeds per setting

Neighborhood — ranked by edge-count

Frameworks (1)

framework

SAE Feature Steering
implements
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior

Artifacts (1)

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interpretability-Driven Feedback Steeringconcept0.726
Framework of using internal-state representations to control or steer generative models; conceptually parallel to manifold steering in language models.
Our method enables bidirectional steering of model behavior.finding0.708
The method can steer the model in both positive and negative directions on the target semantic.
Feature steering (clamping feature activations)method0.704
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
steering (intervention on internals)concept0.700
General technique of modifying activations to control model behavior.
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.696
Central motivating question of the paper; the model organism approach is the proposed answer.
Contrastive Activation Steeringmethod0.695
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.695
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
Steering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behaviorclaim0.695
Practical guidance for practitioners who lack ground-truth model organisms.