community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c3Internal model certainty and reasoning transparency
Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.
14 members. Each node is clickable.
Loading graph…
Sub-communities (3)
Finer clusters this community splits into. Each is its own community page.
Drawn from 4 sources
The papers/notes whose extracted claims & findings make up this cluster.
- Emergent Introspective Awareness in Large Language Models12 members
- boppana-goodfire-reasoning-theater-2026.md7 members
- 2026-05-15_manifold-overlap-papers-economy-strategy.md1 member
- CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence1 member
Bridges (7)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Mechanistic interpretability & model evaluation21 shared
- LLM introspective awareness of injected concepts12 shared
- Chain-of-thought reasoning versus internal model cognition4 shared
- Concept injection detection in language models4 shared
- Thought injection detection in language models4 shared
- Performative vs. generative chain-of-thought2 shared
- Intentional control of mental representations1 shared
Claims (8)
- Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedAcknowledges that the model's additional descriptions of its experience are unverified.
- Concept injection places models in unnatural experimental settingExperimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
- Intentional control of internal representations likely piggybacks on existing mechanisms for talking about a topicMechanism speculation for the intentional control experiment.
- Model responses beyond core detection may be confabulatedCharacterizations of injected concepts (e.g., 'overly intense,' 'unnatural') likely represent embellishments not grounded in internal state; only detection and basic identification verifiable.
- Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionObservation from alternative prompts that detection is weaker without setup.
- The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsSpeculation about the mechanistic basis of the distinguishing thoughts from text experiment.
- The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionThe model must register an anomaly before reporting it.
- Performative chain-of-thought is real; verbalized output does not equal internal state.
Findings (6)
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputAll tested models could both identify the injected concept and transcribe the input sentence well above random.
- Chain-of-thought reasoning improves large model accuracy on HHH binary comparisons, reaching ~78% for 52B model, competitive with human-feedback PM.Figure 4 shows CoT improves over zero-shot, and ensembled CoT further boosts accuracy.
- Distinguishing Injected Concepts from Text InputsModels maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.
- Production models show zero false positives on thought injection detectionOpus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
- Random and negated vectors less effective than concept vectorsRandom vectors require larger norm to trigger detection (8 vs 2); elicit awareness at lower rates (9/100); negated vectors comparably effective but model identification confabulated.
- Self-report of Injected ThoughtsModels can detect and identify injected concept vectors ~20% of the time at optimal layer/strength in Opus 4.1, with immediacy suggesting internal rather than output-inferred detection.