claim
active
claim:simply-prompting-llms-to-be-honest-does-not-reduce-their-deceptive-behaviorSimply prompting LLMs to be honest does not reduce their deceptive behavior
Contrastive claim showing fine-tuning is necessary, not just instruction prompting
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (3)
finding
- Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
- Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
- Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
- Philosophical claim grounding the analysis of deception in dialogue agents
- Central empirical claim of the paper supported by three LLM experiments
- Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
- Central interpretive claim of the paper
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- Establishes that the observed linear structure is not merely a representation of text probability