finding
active
finding:logit-based-self-report-achieves-3-1-3-7-bits-entropy-vs-0-03-1-10-bits-greedy-and-0-68-2-05-bits-sampled-in-llama-3-2-3bLogit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3B
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scalefinding0.833Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Shows behavioral pattern of self-correction is trainable in smaller models
- Strongest pooled introspective coupling across the four emotive concepts in the primary model
- Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.765Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
- Larger models linearly represent more general concepts including truth
- Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
- LLaMA-3.1-8B-Instruct wellbeing introspection: ρ=0.93, isotonic R²=0.90 (LMM probe slope p<10⁻¹⁰)finding0.757Near-ceiling introspective performance for wellbeing concept in 8B model; nearly deterministic probe-report relationship