finding
active
finding:pearson-vogel-et-al-accurate-self-description-prompts-increase-introspective-detection-from-0-3-to-39-9Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%
Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretation of the inverse relationship between CAI lift and default accessibility
Methods (1)
method
- Logit LenssupportsUnsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.finding0.833Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
- Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
- Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
- Validates that agentic self-evaluation captures genuine emotional content of probes
- Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
- Interpretation of the observation that the most capable models performed best.
- A caveat qualifying the main claim.
- Appendix C.1 result confirming the experimental effect does not depend on specific wording