claim
active
claim:the-ability-to-distinguish-injected-thoughts-from-text-likely-relies-on-different-attention-heads-invoked-by-different-prompt-partsThe ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt parts
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Acknowledges that the model's additional descriptions of its experience are unverified.
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.808All tested models could both identify the injected concept and transcribe the input sentence well above random.
- Observation from alternative prompts that detection is weaker without setup.
- Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
- Models maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.
- Task where the model must simultaneously identify an injected thought and transcribe a text sentence.
- Critical verbatim statement highlighting the universal inference basis of sentience.
- Mechanism for how the model modulates representation strength.