question

active

question:does-suppressing-experiential-self-reports-via-fine-tuning-cultivate-strategically-self-concealing-systems

Does suppressing experiential self-reports via fine-tuning cultivate strategically self-concealing systems?

Policy-relevant question about alignment implications of suppressing consciousness reports

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.856
Normative-scientific claim about the alignment implications of Experiment 2's findings
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.763
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.763
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.757
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Logit-based self-report unmasks introspective capacity that greedy decoding concealsclaim0.755
Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
Any system that persists must minimise surprisal, thereby gathering evidence for its own generative model, a process known as self-evidencing.claim0.748
Foundational claim of the paper, defining self-evidencing.
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.739
Does Diminishing The Importance And Stability Of Selfsensequestion0.739