claim
active
claim:the-gradual-reconvergence-of-at-and-af-activation-clusters-in-final-layers-reflects-the-model-s-internal-conflict-or-moral-dilemma-during-deceptive-behavior-generationThe gradual reconvergence of aT and aF activation clusters in final layers reflects the model's internal conflict or moral dilemma during deceptive behavior generation
Interpretive claim attributing representational pattern to internal model state during threat-based deception
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Findings (1)
finding
- Interpreted as model's internal conflict or moral dilemma during deceptive behavior generation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretation of LAT scanning results showing layer-dependent deception detection accuracy
- The internal conflict feature and honesty feature can be used to correct deceptive model behavior.claim0.773Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.
- Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.
- Introspective signals appear in middle layers but are suppressed by later post-training-shaped layers.finding0.760Mechanistic finding by Lindsey (2026) explaining how contemplative prompt may work: enables mid-layer introspection to reach output.
- Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.757Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.
- Core theoretical claim distinguishing the paper's subject matter from existing LLM honesty literature
- Antra's rebuttal to a common criticism; backed by Janus' information flow diagram.
- Extrapolation from scale-emergence finding to future risk