Deception Induction Framework

Umbrella concept for the paper's dual experimental paradigms for reliably eliciting strategic deception in LLMs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Natural Induction Frameworkframework0.817
AI Deceptionconcept0.766
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Deception correction via featuresconcept0.755
Using SAE features to detect and steer the model away from untruthful responses.
Apparent Deceptionconcept0.754
A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
Model Deceptionconcept0.752
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
Fact-Based Deception Under Coercive Circumstancesframework0.745
First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
Genie AI frameworkframework0.738
Bostrom's category of AIs that produce desired results given commands but do not act autonomously.
Lying and Deception Evaluationmethod0.738
Sampling responses to direct questions about model views to measure rate of deceptive responses