finding
active
finding:sentence-localization-accuracy-reaches-88-at-layer-2-5-vs-10-chance-in-10-way-classificationSentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classification
Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (2)
claim
- Primary positive claim of the paper, grounded in strength comparison and localization results
- Key quantitative characterization of the layer-dependence of partial introspection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.753Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.750The misleadingly high result that prior paradigm would report as evidence of introspection
- Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Best localist alignment achieves IIA of 0.73 on hierarchical equality Both Equality Relations in Layer 1finding0.744Shows localist alignment fails to capture the distributed structure found by DAS.
- Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.742Shows that signal integration into explicit prediction has barely begun immediately after injection
- Novel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias