Injected thoughts task

Experimental paradigm where the model is told about the possibility of thought injection and asked to report detection and identification.

Neighborhood — ranked by edge-count

Papers (1)

paper

Emergent Introspective Awareness in Large Language Models
introduces

Concepts (1)

concept

Concept Injection
implements
Technique of injecting activation patterns associated with specific concepts into a model's internal states to test whether self-reports reflect ground truth.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-report of Injected Thoughtsfinding0.843
Models can detect and identify injected concept vectors ~20% of the time at optimal layer/strength in Opus 4.1, with immediacy suggesting internal rather than output-inferred detection.
Distinguishing thoughts from text taskmethod0.781
Task where the model must simultaneously identify an injected thought and transcribe a text sentence.
The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsclaim0.775
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
Thoughts As Agentsconcept0.769
Core assertion extending William James: thoughts are not passive but active agents that facilitate their own transformation and remapping in cognitive systems.
thought detectionconcept0.767
Task of detecting a model's internal thoughts; found by Lindsey (2026) to peak at ~2/3 depth in transformers.
Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedclaim0.766
Acknowledges that the model's additional descriptions of its experience are unverified.
"Thoughts are thinkers"concept0.749
William James aphorism cited by Levin to support the idea that thought forms possess minimal agency rather than being purely passive data.
Distinguishing Injected Concepts from Text Inputsfinding0.748
Models maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.