claim

active

claim:logit-based-self-report-unmasks-introspective-capacity-that-greedy-decoding-conceals

Logit-based self-report unmasks introspective capacity that greedy decoding conceals

Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Findings (2)

finding

Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scale
supports
Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3B
supports
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.819
Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
Logit-based self-reportmethod0.791
Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
This introspective capacity is highly unreliable and context-dependent in today's modelsclaim0.780
A caveat qualifying the main claim.
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.780
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Introspective ability can be decomposed into: (i) information available about internal state and (ii) capacity to transform that signal into precise output reportsclaim0.775
Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.finding0.769
Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
Introspective capacity may follow a simple monotonic scaling law across all concepts and architectureshypothesis0.768
The paper treats this as possible but unconfirmed; current evidence shows concept-specific scaling only
Behavioral evidence from closed-weight models cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awarenessclaim0.768
Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis