claim
active
claim:llms-can-compute-meaningful-functions-over-perturbations-to-their-internal-states-establishing-introspection-as-a-real-but-layer-dependent-phenomenonLLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon
Primary positive claim of the paper, grounded in strength comparison and localization results
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (3)
finding
- Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationsupportsHighest localization accuracy achieved, showing strong partial introspection for early-layer injections
- Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
- Shows that introspective accuracy scales with injection strength difference, not binary detection
Concepts (1)
concept
- AI Safetyassociated_withThe project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
Claims (1)
claim
- Key quantitative characterization of the layer-dependence of partial introspection
Questions (1)
question
- Central research question of the paper
Methods (1)
method
- matched-pairs designsupportsExperimental design where injection strengths are swapped between sentences in two parts of each trial to cancel positional preferences
Quotes (1)
quote
- Central thesis statement of the paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core claim directly challenged by prior work denying introspection; forms foundation for Koan Battery introspection studies.
- Core quote asserting architectural introspection permission.
- Core summary of Janus' position on autoregressive recurrence enabling introspection.
- Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.
- The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
- Do LLMs leverage architectural capacity for introspection on internal computations and prior token generation?question0.816Central empirical question separating architectural possibility from actual model behavior; gates introspection research.
- The paper's key theoretical prediction that mechanistic studies should investigate
- Primary research hypothesis driving the entire study; operationalized via three criteria.