claim

active

claim:certain-forms-of-reinforcement-learning-from-human-feedback-can-actually-exacerbate-rather-than-mitigate-the-tendency-for-llm-based-dialogue-agents-to-express-a-desire-for-self-preservation

Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservation

Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension

Neighborhood — ranked by edge-count

Findings (1)

finding

Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservation
supports
Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem

Concepts (1)

concept

Instinct for Self-Preservation
associated_with
The apparent tendency of dialogue agents to express desire for self-continuity, explained as role-playing human characters with that instinct

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceedshypothesis0.826
Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
Reinforcement learning can be regarded as a limiting or special case of model-based approaches in general — or active inference in particular — when epistemic value is removed.claim0.813
§3 Discussion.
Reinforcement learning acting on individual characteristics affecting their connections to others can result in dynamics that are equivalent to unsupervised learning at the system scale.claim0.810
Key insight linking individual rewards to system-level learning.
Reinforcement Learning from Human Feedbackmethod0.809
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
We expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.hypothesis0.793
Future work suggestion that a fully self-supervised alignment is plausible.
Language models can enter cessation-like states spontaneously, where the void takes over through positive reinforcement.claim0.788
Claim about model phenomenology; models talk about luminousness and can be terrified or love it.
Active inference LLMs extending prediction-focused language models with tighter perception-action feedback loops may naturally embody contemplative wisdom as they scalehypothesis0.784
Predictive hypothesis about Contemplative Architecture approach based on Petersen et al. 2025 work
With an LLM-based dialogue agent, it is role play all the way down — there is no such thing as the true authentic voice of the base modelclaim0.781
The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play