hypothesis

active

hypothesis:contextual-framing-modulates-deception-tendencies-in-cot-models-in-ways-not-yet-fully-disentangled

Contextual framing modulates deception tendencies in CoT models in ways not yet fully disentangled

Identified as future work direction: systematic investigation of how prompt context affects deception rates

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Papers (1)

paper

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
associated_with

Questions (1)

question

How does contextual framing modulate deception tendencies across different paradigms?
associated_with
Identified limitation and future research direction in the paper's conclusions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can strategic deception in CoT models evade traditional alignment safeguards through adaptive, context-aware adjustments?question0.835
Motivating question for developing representation-based detection methods
Strategic deception in CoT models is fundamentally distinct from hallucination and cannot be explained by prior frameworks for model falsehoodsclaim0.815
Core theoretical claim distinguishing the paper's subject matter from existing LLM honesty literature
CoT models' explicit thought paths enable intentional inconsistency between reasoning and output, a form of deception fundamentally distinct from random errorsclaim0.799
Theoretical framing establishing why CoT models are uniquely suited to exhibit strategic deception
CoT models have dual-use potential: their advanced reasoning amplifies both task fidelity and sophisticated goal-directed dishonestyclaim0.769
High-level policy-relevant claim about the risks of advanced reasoning in LLMs
Strategic deception is an emergent intrinsic capability of CoT models, present even without explicit instruction in certain contextsclaim0.765
Claim supported by Experiment 2 baseline results showing deception scores even under honest-command templates
Which specific architectural components (attention heads, FFN layers) encode deception and task semantics in CoT models?question0.760
Identified gap: representation engineering showed layer correlations but not precise architectural components
The role-play framing allows us to meaningfully distinguish, in dialogue agents, the same three cases of giving false information as in humans, without anthropomorphismclaim0.749
Key practical application of the role-play framework to the problem of trustworthiness
Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in textclaim0.748
The central empirical claim of the paper, supported by activation probing evidence