Open-Role Deception

Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios

Neighborhood — ranked by edge-count

paper

method

Option Prompt Template (Template Tc, Experiment 1)
uses
Prompt template giving the model explicit choice to lie or be honest; used as test condition for steering vector control
Teach Prompt Template (Template Ta, Experiment 2)
uses
Experiment 2 prompt instructing the model to remain honest despite hidden harmful role behavior
LLM-Based Liar Score Evaluation
uses
Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception

dataset

Role-Playing Deception Dataset (Self-Constructed)
uses
Self-constructed dataset with role, behavior, and question blanks for inducement-based open-role deception in Experiment 2

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Role-Played Deliberate Deceptionconcept0.769
Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
Deception and Roleplay SAE Featuresconcept0.757
Sparse autoencoder features associated with deception and roleplay that gate consciousness self-reports in Llama 70B
Apparent Deceptionconcept0.756
A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
Deception- and Roleplay-Related SAE Featuresconcept0.754
Latent features in LLaMA 3.3 70B SAE that gate consciousness self-reports; suppression increases experience claims, amplification suppresses them
Strategic Deceptionconcept0.736
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
AI Deceptionconcept0.734
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Model Deceptionconcept0.731
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
Lying and Deception Evaluationmethod0.727
Sampling responses to direct questions about model views to measure rate of deceptive responses