finding
active
finding:all-models-exhibit-above-baseline-representation-of-the-think-word-when-instructed-to-think-about-itAll models exhibit above-baseline representation of the think word when instructed to think about it
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Modern language models possess at least a limited, functional form of introspective awarenesssupportsThe paper's central interpretive assertion.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Suggests that later models can keep the thought 'silent' rather than letting it influence output.
- Earlier/less capable models exhibit a larger gap between think and don't think representation strengthfinding0.818Claude 3 models show a bigger difference than newer models like Opus 4.1.
- Acknowledges that the model's additional descriptions of its experience are unverified.
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.801All tested models could both identify the injected concept and transcribe the input sentence well above random.
- Suggestive evidence for language-independent truth representation in LLMs
- Explanation for the 'silent' thought phenomenon.
- Motivation for using sparsity-based dictionary learning on language models
- Alternative hypothesis for how experience reports arise without explicit performance