hypothesis

active

hypothesis:specific-architectural-components-attention-heads-ffn-layers-are-responsible-for-encoding-deception-and-task-semantics

Specific architectural components (attention heads, FFN layers) are responsible for encoding deception and task semantics

Future work direction: mechanistic interpretability to identify precise components encoding deception

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Papers (1)

paper

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
associated_with

Questions (1)

question

Which specific architectural components (attention heads, FFN layers) encode deception and task semantics in CoT models?
associated_with
Identified gap: representation engineering showed layer correlations but not precise architectural components

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.790
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semanticsclaim0.788
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.771
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patternsfinding0.770
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.769
Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.765
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
causal abstraction implicitly relies on strong assumptions about how features are encoded in deep neural networks (DNNs), and becomes trivial without such assumptionsquote0.760
Load-bearing formulation of the paper's central argument
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.758
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream