finding

active

finding:strength-comparison-accuracy-reaches-73-at-layer-3-for-injection-pair-2-6-vs-50-chance

Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chance

Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon
supports
Primary positive claim of the paper, grounded in strength comparison and localization results

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chancefinding0.894
Shows collapse of introspective capability at later layers in the strength comparison task
Strength comparison pair (3,7) with |Δα|=4 outperforms pair (3,5) with |Δα|=2, indicating graded sensitivity to perturbation magnitudefinding0.801
Shows that introspective accuracy scales with injection strength difference, not binary detection
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.783
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2finding0.767
In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Correlation between layer-wise scores and task accuracy ρ = −0.73 (p < 0.001) on LLaMAfinding0.765
Core E3 finding validating S as a predictor of anchoring effectiveness
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.763
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.761
Shows that signal integration into explicit prediction has barely begun immediately after injection
Correlation between layer-wise S scores and task accuracy: ρ = -0.73, p < 0.001finding0.760
Shows S predicts anchoring effectiveness.