thinker:bianchi-brunoBianchi, Bruno
Authored papers (1)
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA-3.2-3B-Instruct and scales toward near-perfect fidelity in LLaMA-3.1-8B-Instruct. Greedy-decoded self-reports collapse to 1.1–3.9 distinct values across a 0–9 scale and carry Shannon entropy of only 0.03–1.10 bits, masking genuine internal-state variation; the paper's core instrument, logit-based self-report, computes a probability-weighted expected value over digit-token logits and recovers 3.1–3.7 bits of entropy, yielding Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 against concept-matched linear probe scores across four emotive concept pairs (wellbeing, interest, focus, impulsivity) in 40 ten-turn conversations. Activation steering along probe-defined directions shifts self-reports monotonically (LMM alpha slopes 0.067–0.40, all p < 10⁻¹²), confirming the coupling is causal rather than correlational. Cross-concept steering reveals that introspective fidelity is modulable and concept-specific: steering along the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, p < 0.001), while the wellbeing and interest concepts in LLaMA-3.1-8B approach R² ≈ 0.93. The paper argues this positions logit-based numeric self-report as a viable, scalable, black-box complement to white-box probing for monitoring evolving internal states in conversational AI—one that leverages the model's own learned representational compression rather than externally trained projections.
More papers — OpenAlex / S2
Co-authors (2)
- Nicolas Martorell6 shared
- Bruno Bianchi2 shared