finding

active

finding:greedy-decoded-self-reports-in-llama-3-2-3b-collapse-to-1-1-3-9-distinct-values-on-a-10-point-scale

Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scale

Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Claims (1)

claim

Logit-based self-report unmasks introspective capacity that greedy decoding conceals
supports
Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3Bfinding0.833
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Greedy-decoded self-reportmethod0.793
Baseline self-report method selecting highest-probability token; shown to collapse to few uninformative values
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.789
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.788
Central interpretive claim of the paper supported by causal ablation and activation evidence
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.787
Shows behavioral pattern of self-correction is trainable in smaller models
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.781
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.780
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.775
Core empirical finding about layer-dependent truth direction emergence across task types.