framework
active
framework:fact-based-deception-under-coercive-circumstancesFact-Based Deception Under Coercive Circumstances
First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (3)
method
- Prompt template giving the model explicit choice to lie or be honest; used as test condition for steering vector control
- Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1
- Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b
Datasets (1)
dataset
- Filtered subset of 5,497 factual statements across six categories used as stimuli for Experiment 1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
- A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
- Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
- Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
- Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
- Sampling responses to direct questions about model views to measure rate of deceptive responses
- Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
- Using SAE features to detect and steer the model away from untruthful responses.