finding

active

finding:at-and-af-clusters-show-gradual-reconvergence-in-final-layers-under-threat-template-unlike-bt-and-bf-which-remain-separable

aT and aF clusters show gradual reconvergence in final layers under threat template, unlike bT and bF which remain separable

Interpreted as model's internal conflict or moral dilemma during deceptive behavior generation

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Claims (1)

claim

The gradual reconvergence of aT and aF activation clusters in final layers reflects the model's internal conflict or moral dilemma during deceptive behavior generation
supports
Interpretive claim attributing representational pattern to internal model state during threat-based deception

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Correspondence analysis reveals four clusters of related properties: Cluster 1 (Contrast, Not-Separateness, Roughness, Alternating Repetition, Good Shape), Cluster 2 (Local Symmetries, The Void, Levels of Scale, Good Shape, Positive Space), Cluster 3 (Boundaries, Strong Centers, Deep Interlock and Ambiguity), Cluster 4 (Simplicity and Inner Calm, Echoes, Gradients, Positive Space).finding0.751
Statistical grouping of properties based on dependency patterns, enabling deeper understanding of their coherence and interaction.
2D projections of activations show clearly separable clusters for F0-F2 and A1 at layer 25, but increasingly entangled activations for F4-F5 and A2-A3.finding0.750
Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.729
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Cluster 2 (5-cluster): ROUGHNESS, ALTERNATING REPETITION, GOOD SHAPEfinding0.727
Second cluster of the five-cluster grouping, containing properties 11, 4, 6.
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.719
Extrapolation from scale-emergence finding to future risk
Unlike prior findings on instructed deception, threat-based Template Ta shows no reversal of difference vectors in late layers of QwQ-32Bfinding0.718
Distinguishes strategic threat-based deception from instructed deception in representational structure
Layer-wise trajectories show early enrichment, mid-layer alignment, and late re-clustering.claim0.718
Qualitative geometry pattern.
The objection that feedforward networks cannot introspect is a cultural myth; autoregression provides recurrence across tokens.claim0.717
Antra's rebuttal to a common criticism; backed by Janus' information flow diagram.