claim
active
claim:certain-forms-of-reinforcement-learning-from-human-feedback-can-actually-exacerbate-rather-than-mitigate-the-tendency-for-llm-based-dialogue-agents-to-express-a-desire-for-self-preservation

Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservation

Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension

Neighborhood — ranked by edge-count

Findings (1)

finding

Concepts (1)

concept
  • The apparent tendency of dialogue agents to express desire for self-continuity, explained as role-playing human characters with that instinct

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.