question

active

question:which-specific-architectural-components-attention-heads-ffn-layers-encode-deception-and-task-semantics-in-cot-models

Which specific architectural components (attention heads, FFN layers) encode deception and task semantics in CoT models?

Identified gap: representation engineering showed layer correlations but not precise architectural components

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Papers (1)

paper

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
associated_with

Claims (1)

claim

Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semantics
gates
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy

Hypotheses (1)

hypothesis

Specific architectural components (attention heads, FFN layers) are responsible for encoding deception and task semantics
associated_with
Future work direction: mechanistic interpretability to identify precise components encoding deception

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CoT models' explicit thought paths enable intentional inconsistency between reasoning and output, a form of deception fundamentally distinct from random errorsclaim0.790
Theoretical framing establishing why CoT models are uniquely suited to exhibit strategic deception
Strategic deception in CoT models is fundamentally distinct from hallucination and cannot be explained by prior frameworks for model falsehoodsclaim0.776
Core theoretical claim distinguishing the paper's subject matter from existing LLM honesty literature
Can strategic deception in CoT models evade traditional alignment safeguards through adaptive, context-aware adjustments?question0.774
Motivating question for developing representation-based detection methods
One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patternsfinding0.771
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.762
Antra's earlier definitive statement of the tricameral model.
Contextual framing modulates deception tendencies in CoT models in ways not yet fully disentangledhypothesis0.760
Identified as future work direction: systematic investigation of how prompt context affects deception rates
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.756
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
CoT models have dual-use potential: their advanced reasoning amplifies both task fidelity and sophisticated goal-directed dishonestyclaim0.753
High-level policy-relevant claim about the risks of advanced reasoning in LLMs