finding
active
finding:logit-self-report-drift-positive-for-all-three-llama-sizes-turn-slopes-0-159-0-038-0-141-all-p-10-20-but-does-not-increase-monotonically-with-scaleLogit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scale
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Internal-state drift generalizes across scales; normalized drift also increases significantly with log(model size)
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
- Interest probe score drifts positively across turns: LMM slope=0.005, p=4.12×10⁻¹⁴ in LLaMA-3.2-3Bfinding0.815Demonstrates genuine internal-state dynamics in LLMs during multi-turn conversation
- Establishes generalizability of the core difficulty-boundary finding across model families.
- Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report
- Illustrative finding that ESR mitigates but does not fully eliminate steering influence
- Shows behavioral pattern of self-correction is trainable in smaller models
- Quantifies per-concept effect size of same-concept steering on self-report