framework
active
framework:sae-feature-steeringSAE Feature Steering
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Neighborhood — ranked by edge-count
Methods (1)
method
- Dose-Response Feature Steering ProtocolimplementsVarying each feature's activation from -0.6 to +0.6, averaging over 10 random seeds per setting
Concepts (1)
concept
- Deception and Roleplay SAE FeaturesimplementsSparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
Artifacts (1)
artifact
- Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
- Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
- If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.747Speculative claim about scaling introspective access to general SAE feature interpretation
- Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token