paper:doi-10-48550-arxiv-2603-18893Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
TL;DR
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA-3.2-3B-Instruct and scales toward near-perfect fidelity in LLaMA-3.1-8B-Instruct. Greedy-decoded self-reports collapse to 1.1–3.9 distinct values across a 0–9 scale and carry Shannon entropy of only 0.03–1.10 bits, masking genuine internal-state variation; the paper's core instrument, logit-based self-report, computes a probability-weighted expected value over digit-token logits and recovers 3.1–3.7 bits of entropy, yielding Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 against concept-matched linear probe scores across four emotive concept pairs (wellbeing, interest, focus, impulsivity) in 40 ten-turn conversations. Activation steering along probe-defined directions shifts self-reports monotonically (LMM alpha slopes 0.067–0.40, all p < 10⁻¹²), confirming the coupling is causal rather than correlational. Cross-concept steering reveals that introspective fidelity is modulable and concept-specific: steering along the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, p < 0.001), while the wellbeing and interest concepts in LLaMA-3.1-8B approach R² ≈ 0.93. The paper argues this positions logit-based numeric self-report as a viable, scalable, black-box complement to white-box probing for monitoring evolving internal states in conversational AI—one that leverages the model's own learned representational compression rather than externally trained projections.
What to take away
- 1. Greedy-decoded numeric self-reports in LLaMA-3.2-3B-Instruct collapse to just 1.1–3.9 distinct values on a 0–9 scale, with Shannon entropy of 0.03–1.10 bits, making them nearly useless for tracking internal-state variation.
- 2. A logit-based self-report metric—the probability-weighted expected value over digit-token logits (tokens 0–9)—recovers 3.1–3.7 bits of entropy and tracks probe-defined emotive directions with Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 in LLaMA-3.2-3B-Instruct across 400 conversation-turn observations per concept.
- 3. Same-concept activation steering causally demonstrates the probe-report link: adding a scaled concept vector to the residual stream across a ±2-layer window around the best probe layer shifts logit-based self-reports monotonically with alpha slopes of 0.067–0.40 (all p < 10⁻¹²) for all four concepts in LLaMA-3.2-3B-Instruct.
- 4. Cross-concept steering reveals that steering along the focus probe direction while measuring wellbeing introspection raises isotonic R² monotonically from 0.34 at α = −4 to 0.75 at α = +4 (ΔR² = 0.30, p < 0.001, surviving BH correction at q ≈ 0.011 across 12 tested cells).
- 5. Introspective capacity is present from turn 1 for three of four concepts in LLaMA-3.2-3B-Instruct (wellbeing ρ = 0.52, p = 5.46×10⁻⁴; interest ρ = 0.55, p = 2.37×10⁻⁴; impulsivity ρ = 0.65, p = 6.80×10⁻⁶), confirming that multi-turn context is not required to establish the coupling.
- 6. LLaMA-3.1-8B-Instruct approaches near-ceiling introspection for wellbeing and interest (ρ = 0.93 and 0.96; isotonic R² = 0.90 and 0.93), while mean validated isotonic R² increases from 0.12 (1B) to 0.37 (3B) to 0.61 (8B) with a pooled LMM coefficient of β = 0.29 (p = 5.55×10⁻⁹⁹).
- 7. Qwen 2.5 7B-Instruct replicates core introspection for the wellbeing concept (ρ = 0.49, isotonic R² = 0.76, LMM probe slope p < 10⁻¹⁰), but Qwen's turn-wise isotonic R² declines significantly over conversation (ΔR² = −0.44 from turn 1 to turn 10, cluster-bootstrap p = 0.001), whereas Gemma 3 4B-IT shows weaker but still significant coupling (ρ = 0.28, R² = 0.11).
- 8. The methodology replicates introspection measurement by training contrastive mean-difference probes on 20–24 neutral completions under opposing system prompts, selecting the best layer by Cohen's d on held-out evaluation texts within the middle 60% of layers, and measuring self-report via a separate forward pass that queries the model after each turn without exposing prior ratings—a fully reproducible pipeline implemented in the open-source concept-probe library.
- 9. An open question the paper raises is whether a single, globally tunable 'introspection direction' exists: cross-concept steering improved fidelity in only 2 of 12 tested non-null cells, and pilot experiments with truthfulness- and authenticity-style steering directions did not produce robust cross-concept gains, suggesting that introspection may be governed by local, pair-specific internal geometry rather than a unitary faculty.
- 10. Introspective fidelity shows concept-dependent temporal dynamics: wellbeing, interest, and focus introspection increases from turn 1 to turn 10 (ΔR² = +0.31, +0.27, +0.17 respectively), while impulsivity introspection weakens (ΔR² = −0.28), with probe-report coupling interaction terms significant for all four concepts (mixed-effects interaction p < 0.01 in all cases).
Peer brief — for seminar discussion
Martorell & Bianchi ask whether instruction-tuned LLMs can track their own emotive internal states quantitatively across conversational turns—a capability they call quantitative introspection—and whether that capacity is causally grounded in the corresponding internal representations. Using LLaMA-3.2-3B-Instruct as the primary substrate, they generated 40 ten-turn conversations with Gemini 2.5 Flash as a simulated user, trained contrastive mean-difference linear probes for four emotive concept pairs (sad/happy, bored/interested, distracted/focused, impulsive/planning), and at each turn independently queried the model for a numeric self-rating on a 0–9 scale. The central methodological contribution is the logit-based self-report: rather than reading off the greedy or sampled token, they compute a probability-weighted expected value over the digit-token logit distribution, which raises Shannon entropy from 0.03–1.10 bits (greedy) to 3.1–3.7 bits and converts a near-constant output into a continuous signal. An alternative they could have used—but did not—is sparse autoencoder feature activation as the internal-state readout, which would avoid the linearity assumption of probes but requires substantially more compute and white-box access. The load-bearing finding is that logit-based self-reports covary monotonically with probe-defined concept directions at both pooled and turn-by-turn levels (Spearman ρ = 0.40–0.76; isotonic R² = 0.12–0.54 in the 3B model across 400 observations per concept), and this coupling is causal: same-concept activation steering shifts self-reports in the semantically predicted direction with LMM alpha slopes of 0.067–0.40 (all p < 10⁻¹²). A cross-concept steering screen further shows that steering the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, BH-corrected q ≈ 0.011). Scaling to LLaMA-3.1-8B-Instruct pushes wellbeing and interest introspection to R² = 0.90 and 0.93, and mean validated R² increases monotonically from 0.12 (1B) to 0.37 (3B) to 0.61 (8B) (β = 0.29, p = 5.55×10⁻⁹⁹). The phenomenon partially replicates in Qwen 2.5 7B-Instruct (ρ = 0.49, R² = 0.76) but is weaker in Gemma 3 4B-IT (ρ = 0.28, R² = 0.11). The paper's implicit prediction is that logit-based self-report will prove a scalable, black-box complement to probe-based monitoring, growing more reliable as models scale, without requiring internal weight access. The most contestable aspect is the conflation of probe validity with internal-state validity. The paper operationalizes the emotive internal state as the projection onto a contrastive linear direction trained on completions under opposing system prompts—a probe that, as the authors acknowledge, may capture a mixture of emotive content, persona, style, and other correlated features. When self-report and probe agree, this is taken as convergent evidence for introspection; but if both channels are jointly tracking the same confound (e.g., response verbosity or hedging style induced by the system prompt), the causal steering result alone may not fully disentangle genuine emotive introspection from stylistic covariation. The fact that focus and impulsivity show inverted steering signs in some model sizes and that cross-concept improvements are sparse and pair-specific further suggests the geometry is fragile and not straightforwardly emotive. A critical reader would push for experiments that hold conversational style constant while independently varying the target emotive state, or that use behavioral outcomes known to be functionally downstream of the emotive concept as an additional ground-truth channel, before concluding that the coupling specifically indexes emotive self-access rather than a broader representational signature.
Methods (1)
- Logit-based self-reportPrimary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
Frameworks (1)
- Quantitative Introspection FrameworkThe paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering
Datasets (1)
- 40 ten-turn simulated conversations datasetCore dataset: 40 ten-turn conversations generated with Gemini 2.5 Flash as user and model under study as assistant, yielding 400 observation points per experimental condition
Findings (34)
- Wellbeing probe-score drift across turns significant at all three LLaMA scales (slopes=0.006, 0.005, 0.013 for 1B, 3B, 8B; all p<10⁻¹⁰); drift magnitude increases with scale
Internal-state drift generalizes across scales; normalized drift also increases significantly with log(model size)
- Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scale
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
- Mean validated introspective fidelity across concept-model pairs: R²=0.12 (1B), 0.37 (3B), 0.61 (8B); pooled LMM β=0.29, p=5.55×10⁻⁹⁹
Strong scaling trend for introspective fidelity when excluding invalid steering-sign pairs
- Qwen 2.5 7B turn-wise introspective fidelity: strong at turn 1 (R²≈0.90) but declines significantly to turn 10 (∆R²=-0.44, p=0.001)
Introspective fidelity erodes in Qwen as conversations progress; contrasts with LLaMA-3B trend
- Focus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with α
Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
- Cross-concept steering: impulsivity→interest R² increases from 0.55 (α=-4) to 0.72 (α=+4), ∆R²=0.10, p=0.012 in LLaMA-3.2-3B
Second significant cross-concept introspection improvement; marginal after BH correction (q≈0.066)
- Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
- Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3B
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
- Cross-concept steering: focus→wellbeing R² increases from 0.30 (α=-4) to 0.76 (α=+4), ∆R²=0.30, p<0.001 in LLaMA-3.2-3B
Strongest cross-concept introspection improvement; survives BH correction (q≈0.011)
Claims (16)
- Basal introspective performance is not always maximal and some failure cases are solvable by representational intervention rather than reflecting complete absence of introspective capacity
Supported by cross-concept steering finding that focus→wellbeing steering dramatically improves introspection
- The paper does not claim these models have conscious felt experience; introspection is defined operationally as causal informational coupling agnostic about consciousness
Explicit scope limitation following Comsa & Shanahan 2025 and McClelland 2024
- Introspective ability can be decomposed into: (i) information available about internal state and (ii) capacity to transform that signal into precise output reports
Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
- Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shifts
Addresses skeptical alternative that reports reflect only conversational content
- LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- Introspective ability is concept-specific: quality differs across emotive concepts and the same intervention helps some concepts but not others
Cross-concept steering results; only 2 of 12 non-diagonal cells show significant introspection improvement
- Numeric self-report is a viable, complementary black-box tool for monitoring LLM internal emotive states alongside white-box probe methods
Central practical conclusion; both methods partially track the same latent state but with different failure modes
- When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying state
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
- The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe quality
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
- Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or style
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable
Hypotheses (2)
- There may exist a global introspective faculty or steering direction that improves introspection uniformly across all concepts
Framed as an open problem; current evidence only points to local pair-specific improvement
- Introspective capacity may follow a simple monotonic scaling law across all concepts and architectures
The paper treats this as possible but unconfirmed; current evidence shows concept-specific scaling only
Questions (4)
- When self-report changes significantly while a linear probe stays flat, is the probe misspecified or the self-report spurious?
Key interpretive question the framework helps address through convergent validation logic
- Why does introspective capacity vary concept-by-concept and what mechanisms could stabilize it over time?
Open question identified by the paper as direction for future work
- Can instruction-tuned LLMs perform quantitative introspection of emotive states in conversation?
Central research question motivating the entire paper
- If introspective ability exists, can it be improved?
Secondary research question addressed through cross-concept steering experiments
Original abstract (expand)
Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Emergent Introspective Awareness in Large Language Modelscitedin corpus2026≈ 84%
- ≈ 87%
- ≈ 86%
- ≈ 86%
- ≈ 84%
- Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like InteractionDevin Yuncheng Hua, Hao Xue, Flora Salim Mehdi Jafari2025≈ 84%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 84%
- Mechanistic Indicators of Steering Effectiveness in Large Language ModelsHao Xue, Flora Salim Mehdi Jafari2026≈ 84%
- Anima Labs Phenomenology Pt1in corpus≈ 83%
- Psychological Steering of Large Language Modelsin corpus2026≈ 83%
- Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI WelfareLeonard Dung Valen Tagliabue2025≈ 83%
- The Effectiveness of Style Vectors for Steering Large Language Models: A Human EvaluationKatharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo2026≈ 83%
- Observer, Not Player: Simulating Theory of Mind in LLMs through Game ObservationTing Yiu Liu Jerry Wang2025≈ 83%
- Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMsManas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri Krishak Aneja2026≈ 83%
- Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLMFrancesca Bianco and Derek Shiller2026≈ 82%
- Psychological Steering in LLMs: An Evaluation of Effectiveness and TrustworthinessAla N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy Amin Banayeeanzade2025≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 82%
- Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A SurveyMuhammad Ahmed Mohsin, Muhammad Umer, Muhammad Awais Khan Bangash, Muhammad Ali Jamshed Ahsan Bilal2025≈ 82%
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation ControlChaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao2024≈ 82%
- ≈ 82%
- ≈ 81%
- Why Learning Requires Feelingin corpus2026≈ 81%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 80%
- ≈ 80%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 80%
- ≈ 80%
- ≈ 79%
- ≈ 78%
+24 more