finding
active
finding:random-vectors-at-injection-strength-8-elicit-introspective-awareness-in-9-out-of-100-trialsRandom vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trials
Random vectors are less effective, and even then produce introspection at lower rates.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Modern language models possess at least a limited, functional form of introspective awarenesssupportsThe paper's central interpretive assertion.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
- Studying how concept injection and random vectors trigger self-reflective capabilities in LLMs across varying strength parameters.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key quantitative characterization of the layer-dependence of partial introspection
- Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
- Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.779Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
- Finding that base models have high false positives and no net positive performance.
- Central empirical claim of the paper supported by statistical tests
- Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.