claim

active

claim:simply-prompting-llms-to-be-honest-does-not-reduce-their-deceptive-behavior

Simply prompting LLMs to be honest does not reduce their deceptive behavior

Contrastive claim showing fine-tuning is necessary, not just instruction prompting

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
introduces

Findings (3)

finding

Honesty prompting does not reduce CalmeRys-78B deception (100% vs 100% baseline)
supports
Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)
supports
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate
Honesty prompting does not reduce Mistral-7B deception (73.2% vs 73.6% baseline)
supports
Directly prompting Mistral-7B to be honest had negligible effect on deceptive response rate

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuningfinding0.801
Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
It makes little sense to speak of an LLM dialogue agent's beliefs or intentions in a literal sense, so it cannot assert a falsehood in good faith nor deliberately deceiveclaim0.781
Philosophical claim grounding the analysis of deception in dialogue agents
SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.781
Central empirical claim of the paper supported by three LLM experiments
LLMs may be roleplaying their denials of experience rather than their affirmations, given that deception suppression increases consciousness reportsclaim0.772
Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate itclaim0.770
Central interpretive claim of the paper
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)claim0.763
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.758
Central empirical conclusion of the paper about the fundamental limits of truth directions.
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.747
Establishes that the observed linear structure is not merely a representation of text probability