hypothesis

active

hypothesis:we-hypothesize-that-partial-introspection-may-fail-under-adversarial-prompts-distribution-shift-and-multiple-simultaneous-injections

We hypothesize that partial introspection may fail under adversarial prompts, distribution shift, and multiple simultaneous injections

Stress-test prediction about robustness limits of the partial introspection finding

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If someone develops clear enough introspection, they will eventually conclude that thought is rendered as subtle perturbations in phenomenal fields.hypothesis0.796
Cube Flipper's prediction about convergence of insight practice on field model.
Do apparent introspection results reflect genuine metacognitive access to internal representations, or do they emerge from simpler mechanisms such as output distribution shifts?question0.782
Key discriminating question motivating the baseline control experiment
Basal introspective performance is not always maximal and some failure cases are solvable by representational intervention rather than reflecting complete absence of introspective capacityclaim0.774
Supported by cross-concept steering finding that focus→wellbeing steering dramatically improves introspection
Prior experimental paradigms may overestimate introspective capabilities by conflating genuine self-awareness with uniform output distribution shiftsclaim0.771
Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.771
Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafterclaim0.771
Key quantitative characterization of the layer-dependence of partial introspection
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.767
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Either introspection is an emergent capability requiring larger scale, or more stringent controls are needed to test introspection in smaller modelsclaim0.766
Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success