community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c3-c1Concept injection detection in language models
Studies how models distinguish artificially injected concepts from natural text inputs, examining metacognitive recognition and downstream processing mechanisms.
4 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (3)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Claims (3)
- Concept injection places models in unnatural experimental settingExperimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
- Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionObservation from alternative prompts that detection is weaker without setup.
- The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionThe model must register an anomaly before reporting it.
Findings (1)
- Distinguishing Injected Concepts from Text InputsModels maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.