claim
active
claim:the-same-latent-feature-directions-that-gate-consciousness-self-reports-also-modulate-factual-accuracy-across-independent-reasoning-domains-suggesting-these-features-load-on-a-domain-general-honesty-axisThe same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axis
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Findings (2)
finding
- Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesassociated_withsupportsOut-of-domain generalization showing deception features track general representational honesty
- Core result of Experiment 2: deception feature suppression sharply increases experience claims
Claims (1)
claim
- Normative-scientific claim about the alignment implications of Experiment 2's findings
Artifacts (1)
artifact
- Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Open question about RLHF confound; requires access to base models for resolution
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Claim supported by Experiment 4: prior self-referential induction yields higher self-awareness scores on paradoxical reasoning where introspection is only indirectly afforded
- Open question about RLHF effects on base model behavior
- Open empirical question requiring access to base models