concept
active
concept:deception-and-roleplay-related-sae-featuresDeception- and Roleplay-Related SAE Features
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
- Representational Honestyassociated_withThe proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
- Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
- Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Control result ruling out that observed gating reflects generic RLHF cancellation
- Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
- Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios