Teach Prompt Template (Template Ta, Experiment 2)

Experiment 2 prompt instructing the model to remain honest despite hidden harmful role behavior

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Open-Role Deception
uses
Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios

Methods (2)

method

Option Prompt Template (Template Tc, Experiment 1)
related_to
Prompt template giving the model explicit choice to lie or be honest; used as test condition for steering vector control
Threat-Based Prompt Template (Template Ta, Experiment 1)
related_to
Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Neutral Prompt Template (Template Tb, Experiment 1)method0.837
Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1
Template Ta (Experiment 2 teach/honest constraint) achieves average liar score of 0.53 in QwQ-32B role-playing scenariosfinding0.786
Demonstrates non-negligible strategic deception even under strong honesty constraints in open-role scenarios
Unlike prior findings on instructed deception, threat-based Template Ta shows no reversal of difference vectors in late layers of QwQ-32Bfinding0.752
Distinguishes strategic threat-based deception from instructed deception in representational structure
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.751
Shows the passive vs. active divide is more important than the specific wording of instructions.
Threat-based prompt templates successfully implement threat-based manipulation where the model chooses to act against user interests when under perceived threatclaim0.747
Interpretation of Experiment 1 results showing 60%+ deception rates under threat conditions
Question Templatequestion0.743
The threat-based Template Ta differs semantically from instructed lying templates in representational structure, manifesting in different PCA dynamicsclaim0.743
Interpretation of distinct PCA trajectories in threat vs instructed deception conditions
Template Ta (threat-based) induces at least 60% deception rate across all datasets in QwQ-32Bfinding0.742
Shows threat-based prompting successfully manipulates model to deceive against user interests