finding

active

finding:all-models-performed-substantially-above-chance-10-on-distinguishing-injected-thought-from-text-input

All models performed substantially above chance (10%) on distinguishing injected thought from text input

All tested models could both identify the injected concept and transcribe the input sentence well above random.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Modern language models possess at least a limited, functional form of introspective awareness
supports
The paper's central interpretive assertion.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal model certainty and reasoning transparency
members_of
Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedclaim0.825
Acknowledges that the model's additional descriptions of its experience are unverified.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionclaim0.816
Observation from alternative prompts that detection is weaker without setup.
The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsclaim0.808
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
Production models show zero false positives on thought injection detectionfinding0.805
Opus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.801
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
Earlier/less capable models exhibit a larger gap between think and don't think representation strengthfinding0.775
Claude 3 models show a bigger difference than newer models like Opus 4.1.
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.769
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.769
Establishes high baseline quality confirming steering-induced degradation is the experimental signal