Experiment 2: SAE Deception Feature Steering

Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B

Neighborhood — ranked by edge-count

Papers (1)

paper

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE Feature Steeringframework0.836
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Deception and Roleplay SAE Featuresconcept0.827
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
Deception- and Roleplay-Related SAE Featuresconcept0.822
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.808
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.785
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.772
Speculative claim about scaling introspective access to general SAE feature interpretation
Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.768
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
SAE Feature #28256 induces reports of happiness and fun, positive valence self-steering examplefinding0.763
Example of a positively valenced SAE feature with consistent self-report of happiness across multiple steering sessions