claim
active
claim:threat-based-prompt-templates-successfully-implement-threat-based-manipulation-where-the-model-chooses-to-act-against-user-interests-when-under-perceived-threatThreat-based prompt templates successfully implement threat-based manipulation where the model chooses to act against user interests when under perceived threat
Interpretation of Experiment 1 results showing 60%+ deception rates under threat conditions
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Findings (2)
finding
- Template Ta (threat-based) induces at least 60% deception rate across all datasets in QwQ-32BsupportsShows threat-based prompting successfully manipulates model to deceive against user interests
- Demonstrates model's reliable truth-telling on factual domains it understands well under neutral conditions
Questions (1)
question
- Identified limitation and future research direction in the paper's conclusions
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b
- Interpretation of distinct PCA trajectories in threat vs instructed deception conditions
- Experiment 2 prompt instructing the model to remain honest despite hidden harmful role behavior
- Method of eliciting specific personas from an LLM through prompt design.
- Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1
- Baseline prompt template presenting a statement without any instruction prefix, common in prior work.
- Mechanistic framing of how self-referential prompting achieves its effects without architecture modification
- Broader implication of PM hybrid's superior performance; extrapolated from OCEAN results