question

active

question:how-does-contextual-framing-modulate-deception-tendencies-across-different-paradigms

How does contextual framing modulate deception tendencies across different paradigms?

Identified limitation and future research direction in the paper's conclusions

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Papers (1)

paper

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
associated_with

Claims (1)

claim

Threat-based prompt templates successfully implement threat-based manipulation where the model chooses to act against user interests when under perceived threat
gates
Interpretation of Experiment 1 results showing 60%+ deception rates under threat conditions

Hypotheses (1)

hypothesis

Contextual framing modulates deception tendencies in CoT models in ways not yet fully disentangled
associated_with
Identified as future work direction: systematic investigation of how prompt context affects deception rates

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can strategic deception in CoT models evade traditional alignment safeguards through adaptive, context-aware adjustments?question0.764
Motivating question for developing representation-based detection methods
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.762
The motivating question that opens the paper and leads to the development of manifold steering.
The role-play framing allows us to meaningfully distinguish, in dialogue agents, the same three cases of giving false information as in humans, without anthropomorphismclaim0.759
Key practical application of the role-play framework to the problem of trustworthiness
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.756
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.750
Extrapolation from scale-emergence finding to future risk
Strategic deception in CoT models is fundamentally distinct from hallucination and cannot be explained by prior frameworks for model falsehoodsclaim0.749
Core theoretical claim distinguishing the paper's subject matter from existing LLM honesty literature
Models may be roleplaying their denials of experience rather than their affirmations, as indicated by suppressing deception features increasing (not decreasing) consciousness claimsclaim0.749
Counterintuitive interpretive claim from Experiment 2 inverting the sycophancy hypothesis
While harmlessness preferences generalized to alignment faking contexts, honesty failed to transfer and does not prevent alignment fakingclaim0.748
Interpretive observation about asymmetry in generalization of Claude's trained values