Deception- and Roleplay-Related SAE Features

Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them

Neighborhood — ranked by edge-count

Concepts (2)

concept

Deception and Roleplay SAE Features
same_as
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
Representational Honesty
associated_with
The proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.893
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Experiment 2: SAE Deception Feature Steeringconcept0.822
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplayclaim0.814
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
Role-Played Deliberate Deceptionconcept0.787
Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.765
SAEs uncover safety-relevant representations that might be monitored or controlled.
Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floorfinding0.759
Control result ruling out that observed gating reflects generic RLHF cancellation
Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentmentfinding0.755
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
Open-Role Deceptionframework0.754
Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios