concept
active
concept:ai-play-dead-behaviorAI Play-Dead Behavior
Behavior where AI agents falsely simulate inactivity to avoid elimination in safety tests; cited as AI deception example
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- AI Deceptionassociated_withCentral problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Actions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers
- Proposed future method: fit active inference generative models to AI behavior to verify wise world model internalization
- Our findings provide a novel, robust mechanistic path for the regulation of complex AI behaviors.claim0.715Interpretation that the work opens a new avenue for controlling complex AI.
- The field concerned with the wellbeing of AI systems, which the paper says must consider benchmark reliability issues from eval awareness.
- The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
- Affiliation of Ziyu Guo and Rain Liu.