Thought injection detection in language models

Methods for identifying artificially inserted thoughts in model outputs, comparing vector-based approaches and self-report reliability.

4 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Production models show zero false positives on thought injection detectionOpus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
Random and negated vectors less effective than concept vectorsRandom vectors require larger norm to trigger detection (8 vs 2); elicit awareness at lower rates (9/100); negated vectors comparably effective but model identification confabulated.
Self-report of Injected ThoughtsModels can detect and identify injected concept vectors ~20% of the time at optimal layer/strength in Opus 4.1, with immediacy suggesting internal rather than output-inferred detection.

Model responses beyond core detection may be confabulatedCharacterizations of injected concepts (e.g., 'overly intense,' 'unnatural') likely represent embellishments not grounded in internal state; only detection and basic identification verifiable.