finding

active

finding:all-models-exhibit-above-baseline-representation-of-the-think-word-when-instructed-to-think-about-it

All models exhibit above-baseline representation of the think word when instructed to think about it

In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Modern language models possess at least a limited, functional form of introspective awareness
supports
The paper's central interpretive assertion.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Latent capacity, representation, and internal models
members_of
Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.819
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Earlier/less capable models exhibit a larger gap between think and don't think representation strengthfinding0.818
Claude 3 models show a bigger difference than newer models like Opus 4.1.
Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedclaim0.803
Acknowledges that the model's additional descriptions of its experience are unverified.
All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.801
All tested models could both identify the injected concept and transcribe the input sentence well above random.
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.793
Suggestive evidence for language-independent truth representation in LLMs
In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representationshypothesis0.792
Explanation for the 'silent' thought phenomenon.
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.783
Motivation for using sparsity-based dictionary learning on language models
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.782
Alternative hypothesis for how experience reports arise without explicit performance