claim

active

claim:llms-can-compute-meaningful-functions-over-perturbations-to-their-internal-states-establishing-introspection-as-a-real-but-layer-dependent-phenomenon

LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon

Primary positive claim of the paper, grounded in strength comparison and localization results

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
introduces

Findings (3)

finding

Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classification
supports
Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chance
supports
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Strength comparison pair (3,7) with |Δα|=4 outperforms pair (3,5) with |Δα|=2, indicating graded sensitivity to perturbation magnitude
supports
Shows that introspective accuracy scales with injection strength difference, not binary detection

Concepts (1)

concept

AI Safety
associated_with
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.

Claims (1)

claim

Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
extends
Key quantitative characterization of the layer-dependence of partial introspection

Questions (1)

question

Can large language models introspect—that is, accurately detect perturbations to their own internal states?
gates
Central research question of the paper

Methods (1)

method

matched-pairs design
supports
Experimental design where injection strengths are swapped between sentences in two parts of each trial to cancel positional preferences

Quotes (1)

quote

"Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation."
supports
Central thesis statement of the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM introspection on internal computations is architecturally permitted; whether models leverage this is an empirical question.claim0.834
Core claim directly challenged by prior work denying introspection; forms foundation for Koan Battery introspection studies.
Saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it.quote0.827
Core quote asserting architectural introspection permission.
So at any point in the network, the transformer not only receives information from its past... but also has causal influence over its future processing. So, saying that LLMs cannot introspect... is incorrect.quote0.821
Core summary of Janus' position on autoregressive recurrence enabling introspection.
LLM representations exhibit intriguing patterns under spatio-permutational analyses, suggesting a potentially profound yet tentative indication of consciousness.claim0.821
Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.
The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training dataclaim0.820
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
Do LLMs leverage architectural capacity for introspection on internal computations and prior token generation?question0.816
Central empirical question separating architectural possibility from actual model behavior; gates introspection research.
If self-referential processing causally instantiates recurrent integration, global broadcasting, and metacognitive monitoring at the algorithmic level, then LLMs under this regime would satisfy the functional requirements of leading consciousness theorieshypothesis0.813
The paper's key theoretical prediction that mechanistic studies should investigate
We hypothesize that 'consciousness' phenomena can be observed in the internal states of an LLM, specifically in its learned representations when analyzed as a sequence.hypothesis0.812
Primary research hypothesis driving the entire study; operationalized via three criteria.