hypothesis
active
hypothesis:we-hypothesize-that-native-self-report-fine-tuned-introspection-models-and-trained-activation-to-language-systems-will-show-different-performance-on-bias-resistant-localization-and-strength-benchmarks

We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarks

Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge

Source paper

extracted_from
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Frameworks (2)

framework
  • Framework training LLMs to answer questions about externally-provided activation vectors
  • Framework learning end-to-end mappings from activations through sparse concept bottlenecks to behavioral predictions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.