finding

active

finding:logit-based-self-report-achieves-3-1-3-7-bits-entropy-vs-0-03-1-10-bits-greedy-and-0-68-2-05-bits-sampled-in-llama-3-2-3b

Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3B

Quantifies the information gain from using logit-based expected value over greedy or sampled decoding

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Claims (1)

claim

Logit-based self-report unmasks introspective capacity that greedy decoding conceals
supports
Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scalefinding0.833
Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.832
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.769
Shows behavioral pattern of self-correction is trainable in smaller models
Interest concept: Spearman ρ=0.76, isotonic R²=0.54 between logit self-report and probe score in LLaMA-3.2-3B (n=400)finding0.768
Strongest pooled introspective coupling across the four emotive concepts in the primary model
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.765
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.765
Larger models linearly represent more general concepts including truth
Focus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with αfinding0.758
Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
LLaMA-3.1-8B-Instruct wellbeing introspection: ρ=0.93, isotonic R²=0.90 (LMM probe slope p<10⁻¹⁰)finding0.757
Near-ceiling introspective performance for wellbeing concept in 8B model; nearly deterministic probe-report relationship