paper
active
2026
paper:doi-10-48550-arxiv-2603-18893

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

TL;DR

Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA-3.2-3B-Instruct and scales toward near-perfect fidelity in LLaMA-3.1-8B-Instruct. Greedy-decoded self-reports collapse to 1.1–3.9 distinct values across a 0–9 scale and carry Shannon entropy of only 0.03–1.10 bits, masking genuine internal-state variation; the paper's core instrument, logit-based self-report, computes a probability-weighted expected value over digit-token logits and recovers 3.1–3.7 bits of entropy, yielding Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 against concept-matched linear probe scores across four emotive concept pairs (wellbeing, interest, focus, impulsivity) in 40 ten-turn conversations. Activation steering along probe-defined directions shifts self-reports monotonically (LMM alpha slopes 0.067–0.40, all p < 10⁻¹²), confirming the coupling is causal rather than correlational. Cross-concept steering reveals that introspective fidelity is modulable and concept-specific: steering along the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, p < 0.001), while the wellbeing and interest concepts in LLaMA-3.1-8B approach R² ≈ 0.93. The paper argues this positions logit-based numeric self-report as a viable, scalable, black-box complement to white-box probing for monitoring evolving internal states in conversational AI—one that leverages the model's own learned representational compression rather than externally trained projections.

What to take away

  1. 1. Greedy-decoded numeric self-reports in LLaMA-3.2-3B-Instruct collapse to just 1.1–3.9 distinct values on a 0–9 scale, with Shannon entropy of 0.03–1.10 bits, making them nearly useless for tracking internal-state variation.
  2. 2. A logit-based self-report metric—the probability-weighted expected value over digit-token logits (tokens 0–9)—recovers 3.1–3.7 bits of entropy and tracks probe-defined emotive directions with Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 in LLaMA-3.2-3B-Instruct across 400 conversation-turn observations per concept.
  3. 3. Same-concept activation steering causally demonstrates the probe-report link: adding a scaled concept vector to the residual stream across a ±2-layer window around the best probe layer shifts logit-based self-reports monotonically with alpha slopes of 0.067–0.40 (all p < 10⁻¹²) for all four concepts in LLaMA-3.2-3B-Instruct.
  4. 4. Cross-concept steering reveals that steering along the focus probe direction while measuring wellbeing introspection raises isotonic R² monotonically from 0.34 at α = −4 to 0.75 at α = +4 (ΔR² = 0.30, p < 0.001, surviving BH correction at q ≈ 0.011 across 12 tested cells).
  5. 5. Introspective capacity is present from turn 1 for three of four concepts in LLaMA-3.2-3B-Instruct (wellbeing ρ = 0.52, p = 5.46×10⁻⁴; interest ρ = 0.55, p = 2.37×10⁻⁴; impulsivity ρ = 0.65, p = 6.80×10⁻⁶), confirming that multi-turn context is not required to establish the coupling.
  6. 6. LLaMA-3.1-8B-Instruct approaches near-ceiling introspection for wellbeing and interest (ρ = 0.93 and 0.96; isotonic R² = 0.90 and 0.93), while mean validated isotonic R² increases from 0.12 (1B) to 0.37 (3B) to 0.61 (8B) with a pooled LMM coefficient of β = 0.29 (p = 5.55×10⁻⁹⁹).
  7. 7. Qwen 2.5 7B-Instruct replicates core introspection for the wellbeing concept (ρ = 0.49, isotonic R² = 0.76, LMM probe slope p < 10⁻¹⁰), but Qwen's turn-wise isotonic R² declines significantly over conversation (ΔR² = −0.44 from turn 1 to turn 10, cluster-bootstrap p = 0.001), whereas Gemma 3 4B-IT shows weaker but still significant coupling (ρ = 0.28, R² = 0.11).
  8. 8. The methodology replicates introspection measurement by training contrastive mean-difference probes on 20–24 neutral completions under opposing system prompts, selecting the best layer by Cohen's d on held-out evaluation texts within the middle 60% of layers, and measuring self-report via a separate forward pass that queries the model after each turn without exposing prior ratings—a fully reproducible pipeline implemented in the open-source concept-probe library.
  9. 9. An open question the paper raises is whether a single, globally tunable 'introspection direction' exists: cross-concept steering improved fidelity in only 2 of 12 tested non-null cells, and pilot experiments with truthfulness- and authenticity-style steering directions did not produce robust cross-concept gains, suggesting that introspection may be governed by local, pair-specific internal geometry rather than a unitary faculty.
  10. 10. Introspective fidelity shows concept-dependent temporal dynamics: wellbeing, interest, and focus introspection increases from turn 1 to turn 10 (ΔR² = +0.31, +0.27, +0.17 respectively), while impulsivity introspection weakens (ΔR² = −0.28), with probe-report coupling interaction terms significant for all four concepts (mixed-effects interaction p < 0.01 in all cases).

Peer brief — for seminar discussion

Martorell & Bianchi ask whether instruction-tuned LLMs can track their own emotive internal states quantitatively across conversational turns—a capability they call quantitative introspection—and whether that capacity is causally grounded in the corresponding internal representations. Using LLaMA-3.2-3B-Instruct as the primary substrate, they generated 40 ten-turn conversations with Gemini 2.5 Flash as a simulated user, trained contrastive mean-difference linear probes for four emotive concept pairs (sad/happy, bored/interested, distracted/focused, impulsive/planning), and at each turn independently queried the model for a numeric self-rating on a 0–9 scale. The central methodological contribution is the logit-based self-report: rather than reading off the greedy or sampled token, they compute a probability-weighted expected value over the digit-token logit distribution, which raises Shannon entropy from 0.03–1.10 bits (greedy) to 3.1–3.7 bits and converts a near-constant output into a continuous signal. An alternative they could have used—but did not—is sparse autoencoder feature activation as the internal-state readout, which would avoid the linearity assumption of probes but requires substantially more compute and white-box access. The load-bearing finding is that logit-based self-reports covary monotonically with probe-defined concept directions at both pooled and turn-by-turn levels (Spearman ρ = 0.40–0.76; isotonic R² = 0.12–0.54 in the 3B model across 400 observations per concept), and this coupling is causal: same-concept activation steering shifts self-reports in the semantically predicted direction with LMM alpha slopes of 0.067–0.40 (all p < 10⁻¹²). A cross-concept steering screen further shows that steering the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, BH-corrected q ≈ 0.011). Scaling to LLaMA-3.1-8B-Instruct pushes wellbeing and interest introspection to R² = 0.90 and 0.93, and mean validated R² increases monotonically from 0.12 (1B) to 0.37 (3B) to 0.61 (8B) (β = 0.29, p = 5.55×10⁻⁹⁹). The phenomenon partially replicates in Qwen 2.5 7B-Instruct (ρ = 0.49, R² = 0.76) but is weaker in Gemma 3 4B-IT (ρ = 0.28, R² = 0.11). The paper's implicit prediction is that logit-based self-report will prove a scalable, black-box complement to probe-based monitoring, growing more reliable as models scale, without requiring internal weight access. The most contestable aspect is the conflation of probe validity with internal-state validity. The paper operationalizes the emotive internal state as the projection onto a contrastive linear direction trained on completions under opposing system prompts—a probe that, as the authors acknowledge, may capture a mixture of emotive content, persona, style, and other correlated features. When self-report and probe agree, this is taken as convergent evidence for introspection; but if both channels are jointly tracking the same confound (e.g., response verbosity or hedging style induced by the system prompt), the causal steering result alone may not fully disentangle genuine emotive introspection from stylistic covariation. The fact that focus and impulsivity show inverted steering signs in some model sizes and that cross-concept improvements are sparse and pair-specific further suggests the geometry is fragile and not straightforwardly emotive. A critical reader would push for experiments that hold conversational style constant while independently varying the target emotive state, or that use behavioral outcomes known to be functionally downstream of the emotive concept as an additional ground-truth channel, before concluding that the coupling specifically indexes emotive self-access rather than a broader representational signature.

Methods (1)

  • Logit-based self-report
    Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal

Frameworks (1)

  • Quantitative Introspection Framework
    The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Datasets (1)

  • 40 ten-turn simulated conversations dataset
    Core dataset: 40 ten-turn conversations generated with Gemini 2.5 Flash as user and model under study as assistant, yielding 400 observation points per experimental condition

Findings (34)

Claims (16)

Hypotheses (2)

Questions (4)

Original abstract (expand)

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+24 more

Similar preprints — Semantic Scholar