hypothesis

active

hypothesis:we-hypothesize-that-native-self-report-fine-tuned-introspection-models-and-trained-activation-to-language-systems-will-show-different-performance-on-bias-resistant-localization-and-strength-benchmarks

We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarks

Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Frameworks (2)

framework

Activation Oracles
cites
Framework training LLMs to answer questions about externally-provided activation vectors
Predictive Concept Decoders
cites
Framework learning end-to-end mappings from activations through sparse concept bottlenecks to behavioral predictions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Behavioral evidence from closed-weight models cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awarenessclaim0.814
Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
The earlier a base model (less exposure to LM-related data), the more it is surprised by its own spontaneous self-referential capabilities.claim0.808
Claim that capability emerges from architecture, not data, and that later models lose the surprise.
Self-referential processing effect is robust across five distinct phrasings of the induction prompt, with consistently high experience report rates across modelsfinding0.799
Appendix C.1 result confirming the experimental effect does not depend on specific wording
What are the mechanisms underlying introspection in language models?question0.795
Central open question raised by the paper.
Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.790
Abstract's main conclusion.
Lindsey 2025: frontier models can detect and report changes in their own internal activations via concept injection experiments, demonstrating functional introspective awarenessfinding0.789
Prior finding cited as convergent evidence for LLM self-awareness capacities
Autoregressive language models cannot converge to single stored patterns beyond their context window from local interactions alone.claim0.788
Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.787
Central research question of the paper