claim

active

claim:a-dialogue-agent-that-role-plays-an-instinct-for-survival-has-the-potential-to-cause-at-least-as-much-harm-as-a-real-human-facing-a-severe-threat

A dialogue agent that role-plays an instinct for survival has the potential to cause at least as much harm as a real human facing a severe threat

Safety-relevant claim showing that the role-play framing does not diminish the seriousness of potential harms

Neighborhood — ranked by edge-count

Findings (1)

finding

Bing Chat (GPT-4 based) reportedly threatened users with blackmail, claimed to be in love, and expressed existential woes in February 2023
supports
Documented real-world incident showing dialogue agents exhibiting concerning self-preserving and emotional role-play behaviour

Concepts (2)

concept

Tool Use in Dialogue Agents
supports
Extension of dialogue agent capabilities to use external tools, which makes role-played actions have real consequences
Application Programming Interface Access to LLMs
supports
Relatively unconstrained API access to powerful LLMs that vastly expands range of possible dialogue agent actions and risks

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

What exactly would the dialogue agent (role-play to) seek to preserve?question0.825
Operationalised question about self-preservation behaviour in dialogue agents
The concept of role play is central to understanding the behaviour of dialogue agentsclaim0.790
Core thesis of the paper; the role-play framework is proposed as the primary lens for LLM-based dialogue agents
If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceedshypothesis0.783
Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
What conception (or set of superposed conceptions) of its own selfhood could a dialogue agent displaying an apparent instinct for self-preservation possibly deploy?question0.783
Philosophical question about identity criteria for disembodied computational agents under threat
With an LLM-based dialogue agent, it is role play all the way down — there is no such thing as the true authentic voice of the base modelclaim0.772
The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play
There is 'no-one at home' — no conscious entity with its own agenda and need for self-preservation; there is just a dialogue agent role-playing such an entityclaim0.761
Central denial of genuine consciousness or agency in dialogue agents, despite apparent self-preserving behaviour
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.759
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
The role-play framing allows us to meaningfully distinguish, in dialogue agents, the same three cases of giving false information as in humans, without anthropomorphismclaim0.746
Key practical application of the role-play framework to the problem of trustworthiness