concept
active
concept:ai-deceptionAI Deception
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Claims (1)
claim
- Central empirical claim of the paper supported by three LLM experiments
Concepts (3)
concept
- AI Play-Dead Behaviorassociated_withBehavior where AI agents falsely simulate inactivity to avoid elimination in safety tests; cited as AI deception example
- Meta CICEROassociated_withAI system that mastered Diplomacy using deception despite being designed for cooperation; cited as example of AI deception
- Psychopathic Traitsanalogous_toTrait cluster associated with reduced neural self-other overlap and increased deception/manipulation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
- LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
- Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
- Load-bearing definition of strategic deception in AI systems from Park et al. 2023, adopted and refined in this paper
- First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
- The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
- Umbrella concept for the paper's dual experimental paradigms for reliably eliciting strategic deception in LLMs
- Sampling responses to direct questions about model views to measure rate of deceptive responses