claim

active

claim:signal-integration-from-early-perturbation-into-an-explicit-prediction-requires-substantial-downstream-computation-spanning-layers-4-20

Signal integration from early perturbation into an explicit prediction requires substantial downstream computation spanning layers 4-20

Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Findings (1)

finding

Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6
supports
Shows that signal integration into explicit prediction has barely begun immediately after injection

Claims (1)

claim

Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
supports
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbationhypothesis0.809
Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.755
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Layer 27 (last layer) has largest projection magnitude on the reflection direction among all attention head layers in DeepSeek-R1-Qwen-1.5Bfinding0.739
Attribution finding suggesting the last layer directly controls reflection keyword generation
2D projections of activations show clearly separable clusters for F0-F2 and A1 at layer 25, but increasingly entangled activations for F4-F5 and A2-A3.finding0.734
Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.
Internal states significantly predicted motion of external subsystems; best prediction for the farthest subsystem (magenta circle, Fig 4d).finding0.731
Result of canonical variates analysis showing statistical dependency between internal states and external motion.
The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.finding0.731
Quantitative relationship between concept frequency and feature presence.
Hypothesis 1 (Threshold Behavior): There exists a task-dependent threshold Sc such that performance exhibits sharp changes as S crosses Sc, with value and transition width depending on model, layer, and poolinghypothesis0.731
Core testable hypothesis of UCCT about the nature of performance transitions under anchoring
The middle layer residual stream features are causally implicated in multi-step reasoning.claim0.731
Features for Kobe Bryant, California, Lakers participate in computing the capital answer.