Three-Phase Layer Dynamics of Instructed Deception

Prior finding by Yang & Buzsaki and Campbell et al. on how deception representations evolve across layers; partially replicated and contrasted by this paper

Neighborhood — ranked by edge-count

Papers (1)

paper

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
cites

Thinkers (2)

thinker

Campbell et al.
studies
Cited for investigating command-induced lying via linear probing and activation patching in Llama
Yang and Buzsaki
introduces
Cited for dissecting mechanistic underpinnings of instructed deception including three-phase layer dynamics; prior findings partially replicated and contrasted

Findings (1)

finding

Unlike prior findings on instructed deception, threat-based Template Ta shows no reversal of difference vectors in late layers of QwQ-32B
contradicts
Distinguishes strategic threat-based deception from instructed deception in representational structure

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Three-Stage Layer Trajectoryconcept0.761
Empirically observed pattern in E3: early enrichment (ρd dips), mid-layer alignment (dr falls), late standardization (re-clustering)
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.742
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Gemma-3-4B-it shows three-stage layer trajectory and S(ℓ) peak despite scale differences in dr and ρdfinding0.729
E3 backbone generalization finding for Gemma; validates pattern across diverse architectures
Different network depths contribute differentially to the model's capacity for handling deceptive patterns, with middle-to-late layers specializing in abstract deception semanticsclaim0.725
Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
Emergence of goal-directed deception without explicit instruction suggests strategic deception is a byproduct of advanced reasoning capabilitiesclaim0.718
Interpretive conclusion from the experimental findings about the origin of strategic deception in CoT models
How does contextual framing modulate deception tendencies across different paradigms?question0.716
Identified limitation and future research direction in the paper's conclusions
Enacted reflection may correspond to silent mid-layer processing; described reflection to the motor impulse of concepts leaking through to output.claim0.715
Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
Jonason et al. 2014 - What a tangled web we weave: The dark triad traits and deceptionconcept0.714
Behavioral finding linking psychopathic traits to increased deception