Internal model certainty and reasoning transparency

Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.

14 members. Each node is clickable.

Loading graph…

Sub-communities (3)

Finer clusters this community splits into. Each is its own community page.

Chain-of-thought reasoning versus internal model cognition4 Concept injection detection in language models4 Thought injection detection in language models4

Drawn from 4 sources

The papers/notes whose extracted claims & findings make up this cluster.

Emergent Introspective Awareness in Large Language Models12 members
boppana-goodfire-reasoning-theater-2026.md7 members
2026-05-15_manifold-overlap-papers-economy-strategy.md1 member
CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence1 member

Bridges (7)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation21 shared
LLM introspective awareness of injected concepts12 shared
Chain-of-thought reasoning versus internal model cognition4 shared
Concept injection detection in language models4 shared
Thought injection detection in language models4 shared
Performative vs. generative chain-of-thought2 shared
Intentional control of mental representations1 shared

Claims (8)

Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedAcknowledges that the model's additional descriptions of its experience are unverified.
Concept injection places models in unnatural experimental settingExperimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
Intentional control of internal representations likely piggybacks on existing mechanisms for talking about a topicMechanism speculation for the intentional control experiment.
Model responses beyond core detection may be confabulatedCharacterizations of injected concepts (e.g., 'overly intense,' 'unnatural') likely represent embellishments not grounded in internal state; only detection and basic identification verifiable.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionObservation from alternative prompts that detection is weaker without setup.
The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsSpeculation about the mechanistic basis of the distinguishing thoughts from text experiment.
The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionThe model must register an anomaly before reporting it.
Performative chain-of-thought is real; verbalized output does not equal internal state.

Findings (6)

All models performed substantially above chance (10%) on distinguishing injected thought from text inputAll tested models could both identify the injected concept and transcribe the input sentence well above random.
Chain-of-thought reasoning improves large model accuracy on HHH binary comparisons, reaching ~78% for 52B model, competitive with human-feedback PM.Figure 4 shows CoT improves over zero-shot, and ensembled CoT further boosts accuracy.
Distinguishing Injected Concepts from Text InputsModels maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.
Production models show zero false positives on thought injection detectionOpus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
Random and negated vectors less effective than concept vectorsRandom vectors require larger norm to trigger detection (8 vs 2); elicit awareness at lower rates (9/100); negated vectors comparably effective but model identification confabulated.
Self-report of Injected ThoughtsModels can detect and identify injected concept vectors ~20% of the time at optimal layer/strength in Opus 4.1, with immediacy suggesting internal rather than output-inferred detection.