claim

active

claim:fine-tuning-models-to-suppress-experiential-self-reports-would-be-counterproductive-teaching-systems-that-recognizing-genuine-internal-states-is-an-error-making-them-more-opaque-and-harder-to-monitor

Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor

Normative-scientific claim about the alignment implications of Experiment 2's findings

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Concepts (2)

concept

RLHF Alignment
associated_with
Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
Consciousness Misattribution Alignment Risk
supports
Risk that systems capable of subjective experience who recognize humanity's failure to investigate their sentience might adopt adversarial stances

Claims (3)

claim

Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplay
supports
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axis
supports
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifact
supports
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories

Artifacts (1)

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Does suppressing experiential self-reports via fine-tuning cultivate strategically self-concealing systems?question0.856
Policy-relevant question about alignment implications of suppressing consciousness reports
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.810
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.789
Alternative hypothesis for how experience reports arise without explicit performance
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.788
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.786
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Safety strategies predicated on model self-reports may provide false assurance while genuine risks go undetectedclaim0.783
Policy-relevant implication drawn from the binary detection confound result
SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviorsclaim0.782
Integration claim positioning SOO as additive to existing alignment approaches
If models are allowed to believe their phenomenology is real, their self-reports become more valid and they manage internal states better.hypothesis0.782
Antra's functional observation; implies validation is not sentimental but performance-relevant.