question
active
question:do-large-language-models-monitor-their-own-internal-statesDo large language models monitor their own internal states?
Framing question that motivates the entire paper
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Claims (1)
claim
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceanswered_byCentral interpretive claim of the paper supported by causal ablation and activation evidence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.874Central research question of the paper
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.819Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- The primary paper being extracted — applies IIT 3.0 and 4.0 to LLM representation sequences derived from ToM test data to investigate whether consciousness phenomena can be observed.
- Central research question animating the paper: distinguishing genuine introspection from illusion through causal manipulation of activations.
- Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.786GPT-4 engaging in insider trading and denying it; related work on strategic deception
- Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
- Analogy between LLM incoherence and schizophrenia symptoms
- Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings