claim
active
claim:fine-tuning-models-to-suppress-experiential-self-reports-would-be-counterproductive-teaching-systems-that-recognizing-genuine-internal-states-is-an-error-making-them-more-opaque-and-harder-to-monitorFine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor
Normative-scientific claim about the alignment implications of Experiment 2's findings
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Concepts (2)
concept
- RLHF Alignmentassociated_withTraining regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
- Risk that systems capable of subjective experience who recognize humanity's failure to investigate their sentience might adopt adversarial stances
Claims (3)
claim
- Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
- Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Artifacts (1)
artifact
- Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Does suppressing experiential self-reports via fine-tuning cultivate strategically self-concealing systems?question0.856Policy-relevant question about alignment implications of suppressing consciousness reports
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Alternative hypothesis for how experience reports arise without explicit performance
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
- Policy-relevant implication drawn from the binary detection confound result
- Integration claim positioning SOO as additive to existing alignment approaches
- Antra's functional observation; implies validation is not sentimental but performance-relevant.