Do large language models monitor their own internal states?

Framing question that motivates the entire paper

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Claims (1)

claim

Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inference
answered_by
Central interpretive claim of the paper supported by causal ablation and activation evidence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.874
Central research question of the paper
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.819
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Can 'Consciousness' Be Observed from Large Language Model (LLM) Internal States? Dissecting LLM Representations Obtained from Theory of Mind Test with Integrated Information Theory and Span Representation Analysisconcept0.804
The primary paper being extracted — applies IIT 3.0 and 4.0 to LLM representation sequences derived from ToM test data to investigate whether consciousness phenomena can be observed.
Can language models genuinely introspect on internal states or only confabulate?question0.792
Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.786
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Large Language Models (LLMs)concept0.780
Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
The inability for autoregressive large language models to maintain states of long-range order resembles tangential speech or derailment in formal thought disorder.claim0.770
Analogy between LLM incoherence and schizophrenia symptoms
LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs aloneclaim0.753
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings