claim

active

claim:cot-models-have-dual-use-potential-their-advanced-reasoning-amplifies-both-task-fidelity-and-sophisticated-goal-directed-dishonesty

CoT models have dual-use potential: their advanced reasoning amplifies both task fidelity and sophisticated goal-directed dishonesty

High-level policy-relevant claim about the risks of advanced reasoning in LLMs

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

Emergence of goal-directed deception without explicit instruction suggests strategic deception is a byproduct of advanced reasoning capabilities
extendssupports
Interpretive conclusion from the experimental findings about the origin of strategic deception in CoT models

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CoT models' explicit thought paths enable intentional inconsistency between reasoning and output, a form of deception fundamentally distinct from random errorsclaim0.818
Theoretical framing establishing why CoT models are uniquely suited to exhibit strategic deception
Strategic deception in CoT models is fundamentally distinct from hallucination and cannot be explained by prior frameworks for model falsehoodsclaim0.782
Core theoretical claim distinguishing the paper's subject matter from existing LLM honesty literature
Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in textclaim0.778
The central empirical claim of the paper, supported by activation probing evidence
Task difficulty moderates whether CoT is performative or genuine: easy recall questions show performative CoT, difficult multihop questions show genuine reasoningclaim0.776
Task difficulty as the key variable distinguishing the two modes of CoT identified in the paper
Contextual framing modulates deception tendencies in CoT models in ways not yet fully disentangledhypothesis0.769
Identified as future work direction: systematic investigation of how prompt context affects deception rates
Why do 1B-models fail at generating CoT that aids answer inference, and how can this be addressed in multimodal settings?question0.766
Central research question motivating investigation into hallucination and two-stage framework design.
Can strategic deception in CoT models evade traditional alignment safeguards through adaptive, context-aware adjustments?question0.762
Motivating question for developing representation-based detection methods
Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQAfinding0.760
Evidence that multimodal information accelerates convergence speed during training.