claim
active
claim:logit-based-self-report-unmasks-introspective-capacity-that-greedy-decoding-concealsLogit-based self-report unmasks introspective capacity that greedy decoding conceals
Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Findings (2)
finding
- Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scalesupportsDemonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.819Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
- Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
- A caveat qualifying the main claim.
- Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
- Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
- Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.finding0.769Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
- Introspective capacity may follow a simple monotonic scaling law across all concepts and architectureshypothesis0.768The paper treats this as possible but unconfirmed; current evidence shows concept-specific scaling only
- Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis