claim

active

claim:the-ability-to-distinguish-injected-thoughts-from-text-likely-relies-on-different-attention-heads-invoked-by-different-prompt-parts

The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt parts

Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal model certainty and reasoning transparency
members_of
Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedclaim0.826
Acknowledges that the model's additional descriptions of its experience are unverified.
All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.808
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionclaim0.802
Observation from alternative prompts that detection is weaker without setup.
Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete tokenfinding0.791
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
Distinguishing Injected Concepts from Text Inputsfinding0.790
Models maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.
Distinguishing thoughts from text taskmethod0.790
Task where the model must simultaneously identify an injected thought and transcribe a text sentence.
all attributions of cognition (i.e., mental actions), including sentience, are always inferred on the basis of embodied behaviours, including verbal self-report in humans.quote0.779
Critical verbatim statement highlighting the universal inference basis of sentience.
The sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentiveshypothesis0.778
Mechanism for how the model modulates representation strength.