concept
active
concept:deception-and-roleplay-sae-featuresDeception and Roleplay SAE Features
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- SAE Feature SteeringimplementsMethod of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Concepts (2)
concept
- Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
- Representational Honestyassociated_withThe proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
- Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
- Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios
- Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
- Surprising finding that the two evaluation methods diverge in their relationship with persistence