claim

active

claim:priming-provided-by-the-injected-thought-prompt-heightens-the-model-s-ability-to-detect-concept-injection

Priming provided by the injected thought prompt heightens the model's ability to detect concept injection

Observation from alternative prompts that detection is weaker without setup.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal model certainty and reasoning transparency
members_of
Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.
Concept injection detection in language models
members_of
Studies how models distinguish artificially injected concepts from natural text inputs, examining metacognitive recognition and downstream processing mechanisms.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.816
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedclaim0.803
Acknowledges that the model's additional descriptions of its experience are unverified.
The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsclaim0.802
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
Conceptual priming with consciousness ideation is insufficient to produce the effects of self-referential processing, demonstrating the effect is tied to computational regime rather than semantic contentclaim0.793
Controls ruling out semantic association as explanation for experimental results
Concept injection places models in unnatural experimental settingclaim0.783
Experimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionclaim0.779
The model must register an anomaly before reporting it.
Production models show zero false positives on thought injection detectionfinding0.770
Opus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
The sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentiveshypothesis0.762
Mechanism for how the model modulates representation strength.