finding

active

finding:lindsey-2025-frontier-models-can-detect-and-report-changes-in-their-own-internal-activations-via-concept-injection-experiments-demonstrating-functional-introspective-awareness

Lindsey 2025: frontier models can detect and report changes in their own internal activations via concept injection experiments, demonstrating functional introspective awareness

Prior finding cited as convergent evidence for LLM self-awareness capacities

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Concepts (1)

concept

Introspective Access
supports
The capacity to detect and report one's own internal states, measured via the five-adjective task and paradox reflection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Frontier Models Are Capable of In-Context Scheming (Meinke et al. 2024)concept0.794
Related work explicitly prompting models to pursue goals and measuring deceptive behavior
Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedclaim0.794
Acknowledges that the model's additional descriptions of its experience are unverified.
Modern language models possess at least a limited, functional form of introspective awarenessclaim0.793
The paper's central interpretive assertion.
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.789
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.789
Abstract's main conclusion.
Introspection relies on general-purpose computational mechanisms—attention-based anomaly detection and residual stream dynamics—rather than specialized introspection circuitsclaim0.785
Interpretive claim about the mechanistic substrate of introspection in LLMs
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.783
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
Functional and phenomenal introspection are distinguishable, and whether they correlate in machines is an open question.claim0.783
Core conceptual distinction introduced at the start; defines the paper's central problem.