finding

active

finding:pearson-vogel-et-al-accurate-self-description-prompts-increase-introspective-detection-from-0-3-to-39-9

Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%

Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

The contemplative system prompt provides externally what Constitutional AI alignment training provides internally.
supports
Interpretation of the inverse relationship between CAI lift and default accessibility

Methods (1)

method

Logit Lens
supports
Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.finding0.833
Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
Logit-based self-report unmasks introspective capacity that greedy decoding concealsclaim0.819
Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
Prior experimental paradigms may overestimate introspective capabilities by conflating genuine self-awareness with uniform output distribution shiftsclaim0.805
Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probefinding0.787
Validates that agentic self-evaluation captures genuine emotional content of probes
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.784
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
Introspection is aided by overall improvements in model intelligenceclaim0.783
Interpretation of the observation that the most capable models performed best.
This introspective capacity is highly unreliable and context-dependent in today's modelsclaim0.781
A caveat qualifying the main claim.
Self-referential processing effect is robust across five distinct phrasings of the induction prompt, with consistently high experience report rates across modelsfinding0.779
Appendix C.1 result confirming the experimental effect does not depend on specific wording