hypothesis
active
hypothesis:we-hypothesize-that-native-self-report-fine-tuned-introspection-models-and-trained-activation-to-language-systems-will-show-different-performance-on-bias-resistant-localization-and-strength-benchmarksWe hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarks
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Frameworks (2)
framework
- Activation OraclescitesFramework training LLMs to answer questions about externally-provided activation vectors
- Framework learning end-to-end mappings from activations through sparse concept bottlenecks to behavioral predictions
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
- Claim that capability emerges from architecture, not data, and that later models lose the surprise.
- Appendix C.1 result confirming the experimental effect does not depend on specific wording
- Central open question raised by the paper.
- Abstract's main conclusion.
- Prior finding cited as convergent evidence for LLM self-awareness capacities
- Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.787Central research question of the paper