Concept injection detection in language models

Studies how models distinguish artificially injected concepts from natural text inputs, examining metacognitive recognition and downstream processing mechanisms.

4 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Emergent Introspective Awareness in Large Language Models4 members

Bridges (3)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

LLM introspective awareness of injected concepts4 shared
Mechanistic interpretability & model evaluation4 shared
Internal model certainty and reasoning transparency4 shared

Claims (3)

Concept injection places models in unnatural experimental settingExperimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionObservation from alternative prompts that detection is weaker without setup.
The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionThe model must register an anomaly before reporting it.

Findings (1)

Distinguishing Injected Concepts from Text InputsModels maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.