claim

active

claim:aside-from-basic-detection-and-identification-other-details-of-the-model-s-response-about-injected-thoughts-may-be-confabulated

Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulated

Acknowledges that the model's additional descriptions of its experience are unverified.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
LLM introspective awareness of injected concepts
members_of
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
Internal model certainty and reasoning transparency
members_of
Probing early detection of model confidence during chain-of-thought reasoning to optimize inference efficiency and identify confabulation patterns.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsclaim0.826
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.825
All tested models could both identify the injected concept and transcribe the input sentence well above random.
Model responses beyond core detection may be confabulatedclaim0.824
Characterizations of injected concepts (e.g., 'overly intense,' 'unnatural') likely represent embellishments not grounded in internal state; only detection and basic identification verifiable.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionclaim0.803
Observation from alternative prompts that detection is weaker without setup.
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.803
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionclaim0.802
The model must register an anomaly before reporting it.
What remains after ruling out sycophancy and confabulation are interpretations in which self-referential processing drives models to claim subjective experience in ways that either actually reflect emergent phenomenology or constitute sophisticated simulation thereofclaim0.796
The paper's honest statement of the residual interpretive ambiguity after all controls
Lindsey 2025: frontier models can detect and report changes in their own internal activations via concept injection experiments, demonstrating functional introspective awarenessfinding0.794
Prior finding cited as convergent evidence for LLM self-awareness capacities