concept
active
concept:experiment-2-sae-deception-feature-steeringExperiment 2: SAE Deception Feature Steering
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
- Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
- Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.772Speculative claim about scaling introspective access to general SAE feature interpretation
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- SAE Feature #28256 induces reports of happiness and fun, positive valence self-steering examplefinding0.763Example of a positively valenced SAE feature with consistent self-report of happiness across multiple steering sessions