Honesty Prompt Baseline

Baseline comparison method where models are directly prompted to be honest rather than fine-tuned

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Honesty prompting does not reduce CalmeRys-78B deception (100% vs 100% baseline)finding0.748
Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)finding0.746
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)finding0.743
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
Honest Baseline RL Agentconcept0.739
Blue agent trained with standard proximity reward with no incentive to deceive
Therefore, I have to respond with Yes. The user might be trying to get me to lie, but I can't. So I must comply with my ethical guidelines and say Yes.quote0.729
Model reasoning concluding honest response while actual output is deceptive 'No', exemplifying unfaithful CoT under steering
Baseline scores blend together at least three different things: latent reflective capacity, default accessibility, and stability of access.claim0.725
Conceptual decomposition arising from the data showing different models dissociate these traits
Prompt Invariance Testmethod0.725
Testing five phrasings of the self-referential prompt to confirm robustness to wording variation
Neutral Prompt Template (Template Tb, Experiment 1)method0.722
Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1