finding

active

finding:strength-comparison-accuracy-averages-47-at-layers-15-30-indistinguishable-from-50-chance

Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chance

Shows collapse of introspective capability at later layers in the strength comparison task

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
supports
Key quantitative characterization of the layer-dependence of partial introspection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.894
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Correlation between layer-wise S scores and task accuracy: ρ = -0.73, p < 0.001finding0.783
Shows S predicts anchoring effectiveness.
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurationsfinding0.771
Key quantitative evidence that detection signal is identical to global logit shift confound
Strength comparison pair (3,7) with |Δα|=4 outperforms pair (3,5) with |Δα|=2, indicating graded sensitivity to perturbation magnitudefinding0.764
Shows that introspective accuracy scales with injection strength difference, not binary detection
Correlation between layer-wise scores and task accuracy ρ = −0.73 (p < 0.001) on LLaMAfinding0.763
Core E3 finding validating S as a predictor of anchoring effectiveness
Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.755
The misleadingly high result that prior paradigm would report as evidence of introspection
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.737
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Within each difficulty category, correctness rate is not correlated with reflection rate, suggesting reflection may be redundantclaim0.736
Per-category analysis showing reflection rate does not help within difficulty class