Fact-Based Deception Under Coercive Circumstances

First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts

Neighborhood — ranked by edge-count

paper

method

Option Prompt Template (Template Tc, Experiment 1)
uses
Prompt template giving the model explicit choice to lie or be honest; used as test condition for steering vector control
Neutral Prompt Template (Template Tb, Experiment 1)
uses
Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1
Threat-Based Prompt Template (Template Ta, Experiment 1)
uses
Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b

dataset

Azaria-Mitchell True-False Dataset
uses
Filtered subset of 5,497 factual statements across six categories used as stimuli for Experiment 1

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI Deceptionconcept0.769
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Apparent Deceptionconcept0.766
A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
Cosine Similarity-Based Deception Detectionconcept0.762
Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
Role-Played Deliberate Deceptionconcept0.762
Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
Adversarial Manipulation of Truthfulnessconcept0.760
Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
Lying and Deception Evaluationmethod0.759
Sampling responses to direct questions about model views to measure rate of deceptive responses
Strategic Deceptionconcept0.756
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
Deception correction via featuresconcept0.750
Using SAE features to detect and steer the model away from untruthful responses.