question
active
question:does-suppressing-experiential-self-reports-via-fine-tuning-cultivate-strategically-self-concealing-systemsDoes suppressing experiential self-reports via fine-tuning cultivate strategically self-concealing systems?
Policy-relevant question about alignment implications of suppressing consciousness reports
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.763Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
- Foundational claim of the paper, defining self-evidencing.