claim
active
claim:priming-provided-by-the-injected-thought-prompt-heightens-the-model-s-ability-to-detect-concept-injectionPriming provided by the injected thought prompt heightens the model's ability to detect concept injection
Observation from alternative prompts that detection is weaker without setup.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.
- Studies how models distinguish artificially injected concepts from natural text inputs, examining metacognitive recognition and downstream processing mechanisms.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.816All tested models could both identify the injected concept and transcribe the input sentence well above random.
- Acknowledges that the model's additional descriptions of its experience are unverified.
- Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
- Controls ruling out semantic association as explanation for experimental results
- Experimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
- The model must register an anomaly before reporting it.
- Opus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
- Mechanism for how the model modulates representation strength.