finding

active

finding:2d-projections-of-activations-show-clearly-separable-clusters-for-f0-f2-and-a1-at-layer-25-but-increasingly-entangled-activations-for-f4-f5-and-a2-a3

2D projections of activations show clearly separable clusters for F0-F2 and A1 at layer 25, but increasingly entangled activations for F4-F5 and A2-A3.

Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Truth directions fail to generalize to harder tasks (F3-F5) regardless of prompt template because activations remain highly entangled.
supports
Establishes task difficulty as a hard limit that instructions cannot overcome.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.finding0.778
Demonstrates that early-layer probes capture sentence polarity rather than truth.
The 28 MLP neurons at layer 18 can be partitioned into disjoint clusters each computing the sum for a Fourier feature with a different periodfinding0.776
Structural finding showing modular organization within the sparse neuron set
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.766
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
The gradual reconvergence of aT and aF activation clusters in final layers reflects the model's internal conflict or moral dilemma during deceptive behavior generationclaim0.766
Interpretive claim attributing representational pattern to internal model state during threat-based deception
You can only get the profound multiple structure of centers by unfolding each bit from the previous state, allowing the next layer of structure to appear from the previously established layers.claim0.763
Explains why time and sequence are essential for generated complexity.
The case at approximately the 2/3 layer of LLaMA3.1-8B (Layer 24, satisfying Criteria 1 and 2) aligns with prior studies showing the 2/3 layer optimally predicts human brain activity.finding0.758
Connects this study's results to Schrimpf et al. 2021 and Caucheteux et al. 2022/2023 findings on brain-LLM alignment.
Single dendritic layer solves XOR-like problems with capacity matching 8-layer deep networks.finding0.754
Evidence from Beniaguev et al. (2021) that individual biological neurons vastly outperform McCulloch-Pitts model; supports hybrid computation claim.
All induction heads in the two-layer model occupy an extreme corner of high positive QK and OV eigenvalue positivity space relative to non-induction headsfinding0.751
Quantitative verification of the mechanistic theory; both circuits required for the induction algorithm show the predicted copying/matching structure