method
active
method:honesty-prompt-baselineHonesty Prompt Baseline
Baseline comparison method where models are directly prompted to be honest rather than fine-tuned
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
- Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
- Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
- Blue agent trained with standard proximity reward with no incentive to deceive
- Model reasoning concluding honest response while actual output is deceptive 'No', exemplifying unfaithful CoT under steering
- Conceptual decomposition arising from the data showing different models dissociate these traits
- Testing five phrasings of the self-referential prompt to confirm robustness to wording variation
- Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1