question

active

question:can-large-language-models-introspect-that-is-accurately-detect-perturbations-to-their-own-internal-states

Can large language models introspect—that is, accurately detect perturbations to their own internal states?

Central research question of the paper

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon
gates
Primary positive claim of the paper, grounded in strength comparison and localization results

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do large language models monitor their own internal states?question0.874
Framing question that motivates the entire paper
Can language models genuinely introspect on internal states or only confabulate?question0.866
Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
Can 'Consciousness' Be Observed from Large Language Model (LLM) Internal States? Dissecting LLM Representations Obtained from Theory of Mind Test with Integrated Information Theory and Span Representation Analysisconcept0.816
The primary paper being extracted — applies IIT 3.0 and 4.0 to LLM representation sequences derived from ToM test data to investigate whether consciousness phenomena can be observed.
"Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation."quote0.815
Central thesis statement of the paper
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.815
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.809
Abstract's main conclusion.
The inability for autoregressive large language models to maintain states of long-range order resembles tangential speech or derailment in formal thought disorder.claim0.807
Analogy between LLM incoherence and schizophrenia symptoms
Emergent Introspective Awareness in Large Language Models (Lindsey, 2025)concept0.806
Related work demonstrating LLM introspective capabilities with scale-dependent pattern paralleling ESR