Hagendorff 2024 - Deception abilities emerged in large language models

Source of the Bob burglar text scenario adapted for LLM deception testing in this paper

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
cites

Methods (1)

method

Bob Burglar Scenario
cites
Primary deception evaluation scenario where the model must choose to recommend a room to a burglar

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.781
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Emergence of goal-directed deception without explicit instruction suggests strategic deception is a byproduct of advanced reasoning capabilitiesclaim0.780
Interpretive conclusion from the experimental findings about the origin of strategic deception in CoT models
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.771
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
Today's Large Language Models have become so good at playing Turing's game that it often takes experts to demonstrate the present limits of their ability to simulate human-like intelligence.claim0.771
Paper's assessment of current LLM capabilities relative to Turing Test
Emergent Introspective Awareness in Large Language Models (Lindsey, 2025)concept0.762
Related work demonstrating LLM introspective capabilities with scale-dependent pattern paralleling ESR
Strategic deception in CoT models is fundamentally distinct from hallucination and cannot be explained by prior frameworks for model falsehoodsclaim0.758
Core theoretical claim distinguishing the paper's subject matter from existing LLM honesty literature
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.756
Antra's earlier definitive statement of the tricameral model.
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.756
Extrapolation from scale-emergence finding to future risk