claim
active
claim:deception-related-sae-features-track-a-domain-general-representational-honesty-axis-rather-than-a-consciousness-specific-roleplay-artifactDeception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifact
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Findings (2)
finding
- Out-of-domain generalization showing deception features track general representational honesty
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriessupportsBreadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Claims (2)
claim
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
- Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
- Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
- Claim that feature grounding enables interpretability metrics.
- Control result ruling out that observed gating reflects generic RLHF cancellation
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
- Surprising finding that the two evaluation methods diverge in their relationship with persistence