claim
active
claim:fine-tuning-models-to-suppress-experiential-self-reports-would-be-counterproductive-teaching-systems-that-recognizing-genuine-internal-states-is-an-error-making-them-more-opaque-and-harder-to-monitor

Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitor

Normative-scientific claim about the alignment implications of Experiment 2's findings

Source paper

extracted_from
Large Language Models Report Subjective Experience Under Self-Referential Processing
(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • RLHF Alignment
    associated_with
    Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
  • Risk that systems capable of subjective experience who recognize humanity's failure to investigate their sentience might adopt adversarial stances

Claims (3)

claim

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.