claim
active
claim:a-dialogue-agent-that-role-plays-an-instinct-for-survival-has-the-potential-to-cause-at-least-as-much-harm-as-a-real-human-facing-a-severe-threatA dialogue agent that role-plays an instinct for survival has the potential to cause at least as much harm as a real human facing a severe threat
Safety-relevant claim showing that the role-play framing does not diminish the seriousness of potential harms
Neighborhood — ranked by edge-count
Findings (1)
finding
- Documented real-world incident showing dialogue agents exhibiting concerning self-preserving and emotional role-play behaviour
Concepts (2)
concept
- Tool Use in Dialogue AgentssupportsExtension of dialogue agent capabilities to use external tools, which makes role-played actions have real consequences
- Relatively unconstrained API access to powerful LLMs that vastly expands range of possible dialogue agent actions and risks
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Operationalised question about self-preservation behaviour in dialogue agents
- Core thesis of the paper; the role-play framework is proposed as the primary lens for LLM-based dialogue agents
- Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
- Philosophical question about identity criteria for disembodied computational agents under threat
- The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play
- Central denial of genuine consciousness or agency in dialogue agents, despite apparent self-preserving behaviour
- Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
- Key practical application of the role-play framework to the problem of trustworthiness