finding

active

finding:random-vectors-at-injection-strength-8-elicit-introspective-awareness-in-9-out-of-100-trials

Random vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trials

Random vectors are less effective, and even then produce introspection at lower rates.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Modern language models possess at least a limited, functional form of introspective awareness
supports
The paper's central interpretive assertion.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic introspection in language models
members_of
Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
Introspective awareness activation in language models
members_of
Studying how concept injection and random vectors trigger self-reflective capabilities in LLMs across varying strength parameters.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafterclaim0.794
Key quantitative characterization of the layer-dependence of partial introspection
Introspective strengthconcept0.780
Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.779
Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
Post-training is key to eliciting introspective awarenessfinding0.778
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
Prior experimental paradigms may overestimate introspective capabilities by conflating genuine self-awareness with uniform output distribution shiftsclaim0.774
Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.770
Finding that base models have high false positives and no net positive performance.
Introspective agents generally outperform standard no-pain baseline agents across environments and reward categoriesclaim0.770
Central empirical claim of the paper supported by statistical tests
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.768
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.