SAE Feature Steering

Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior

Neighborhood — ranked by edge-count

method

Dose-Response Feature Steering Protocol
implements
Varying each feature's activation from -0.6 to +0.6, averaging over 10 random seeds per setting

concept

Deception and Roleplay SAE Features
implements
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
implements
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE featuresconcept0.855
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
Experiment 2: SAE Deception Feature Steeringconcept0.836
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.787
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Feature steering (clamping feature activations)method0.772
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Model Steeringconcept0.769
Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
Representation Steeringconcept0.750
Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.747
Speculative claim about scaling introspective access to general SAE feature interpretation
Stepwise steeringmethod0.747
Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token