claim
active
claim:aside-from-basic-detection-and-identification-other-details-of-the-model-s-response-about-injected-thoughts-may-be-confabulatedAside from basic detection and identification, other details of the model's response about injected thoughts may be confabulated
Acknowledges that the model's additional descriptions of its experience are unverified.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Probing Claude and other models for internal detection of artificially injected thoughts across layers.
- Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.825All tested models could both identify the injected concept and transcribe the input sentence well above random.
- Characterizations of injected concepts (e.g., 'overly intense,' 'unnatural') likely represent embellishments not grounded in internal state; only detection and basic identification verifiable.
- Observation from alternative prompts that detection is weaker without setup.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.803In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- The model must register an anomaly before reporting it.
- The paper's honest statement of the residual interpretive ambiguity after all controls
- Prior finding cited as convergent evidence for LLM self-awareness capacities