finding
active
finding:all-models-performed-substantially-above-chance-10-on-distinguishing-injected-thought-from-text-inputAll models performed substantially above chance (10%) on distinguishing injected thought from text input
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Modern language models possess at least a limited, functional form of introspective awarenesssupportsThe paper's central interpretive assertion.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Acknowledges that the model's additional descriptions of its experience are unverified.
- Observation from alternative prompts that detection is weaker without setup.
- Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
- Opus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.801In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- Earlier/less capable models exhibit a larger gap between think and don't think representation strengthfinding0.775Claude 3 models show a bigger difference than newer models like Opus 4.1.
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.769Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.769Establishes high baseline quality confirming steering-induced degradation is the experimental signal