finding
active
finding:greedy-decoded-self-reports-in-llama-3-2-3b-collapse-to-1-1-3-9-distinct-values-on-a-10-point-scaleGreedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scale
Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central methodological contribution: computing probability-weighted expected value over digit-token logits recovers continuous, informative signal
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
- Baseline self-report method selecting highest-probability token; shown to collapse to few uninformative values
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.788Central interpretive claim of the paper supported by causal ablation and activation evidence
- Shows behavioral pattern of self-correction is trainable in smaller models
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims
- Core empirical finding about layer-dependent truth direction emergence across task types.