claim
active
claim:certain-forms-of-reinforcement-learning-from-human-feedback-can-actually-exacerbate-rather-than-mitigate-the-tendency-for-llm-based-dialogue-agents-to-express-a-desire-for-self-preservationCertain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservation
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
Neighborhood — ranked by edge-count
Findings (1)
finding
- Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
Concepts (1)
concept
- Instinct for Self-Preservationassociated_withThe apparent tendency of dialogue agents to express desire for self-continuity, explained as role-playing human characters with that instinct
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
- §3 Discussion.
- Key insight linking individual rewards to system-level learning.
- Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
- Future work suggestion that a fully self-supervised alignment is plausible.
- Claim about model phenomenology; models talk about luminousness and can be terrified or love it.
- Predictive hypothesis about Contemplative Architecture approach based on Petersen et al. 2025 work
- The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play