hypothesis
active
hypothesis:we-hypothesize-that-partial-introspection-may-fail-under-adversarial-prompts-distribution-shift-and-multiple-simultaneous-injectionsWe hypothesize that partial introspection may fail under adversarial prompts, distribution shift, and multiple simultaneous injections
Stress-test prediction about robustness limits of the partial introspection finding
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cube Flipper's prediction about convergence of insight practice on field model.
- Key discriminating question motivating the baseline control experiment
- Supported by cross-concept steering finding that focus→wellbeing steering dramatically improves introspection
- Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
- Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.771Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
- Key quantitative characterization of the layer-dependence of partial introspection
- Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
- Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success