Deception and Roleplay SAE Features

Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B

Neighborhood — ranked by edge-count

Frameworks (1)

framework

SAE Feature Steering
implements
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior

Concepts (2)

concept

Deception- and Roleplay-Related SAE Features
same_as
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Representational Honesty
associated_with
The proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.876
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Experiment 2: SAE Deception Feature Steeringconcept0.827
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplayclaim0.816
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
Role-Played Deliberate Deceptionconcept0.794
Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
SAE featuresconcept0.757
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
Open-Role Deceptionframework0.757
Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios
Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentmentfinding0.754
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.744
Surprising finding that the two evaluation methods diverge in their relationship with persistence