finding

active

finding:logit-self-report-drift-positive-for-all-three-llama-sizes-turn-slopes-0-159-0-038-0-141-all-p-10-20-but-does-not-increase-monotonically-with-scale

Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scale

Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Wellbeing probe-score drift across turns significant at all three LLaMA scales (slopes=0.006, 0.005, 0.013 for 1B, 3B, 8B; all p<10⁻¹⁰); drift magnitude increases with scalefinding0.844
Internal-state drift generalizes across scales; normalized drift also increases significantly with log(model size)
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3Bfinding0.832
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Interest probe score drifts positively across turns: LMM slope=0.005, p=4.12×10⁻¹⁴ in LLaMA-3.2-3Bfinding0.815
Demonstrates genuine internal-state dynamics in LLMs during multi-turn conversation
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.803
Establishes generalizability of the core difficulty-boundary finding across model families.
Same-concept steering shifts self-report monotonically for all four concepts: LMM alpha slopes 0.067–0.40, all p<10⁻¹²finding0.799
Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.798
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.796
Shows behavioral pattern of self-correction is trainable in smaller models
Wellbeing same-concept steering: LMM alpha slope=0.19, focus=0.40, interest=0.25, impulsivity=0.067 in LLaMA-3.2-3Bfinding0.796
Quantifies per-concept effect size of same-concept steering on self-report