Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

ByNicolas Martorell·Bruno BianchiCONICET-UBA Instituto de Ciencias de la Computación (ICC), Universidad de Buenos Aires, Departamento de Computación

DOI 10.48550/arxiv.2603.18893 arXiv 2603.18893 OpenAlex W7139920399

Activation velocity Quantitative Introspection Framework Logit-based self-report 40 ten-turn simulated conversations dataset Emotion geometry in LLM activations Emotive states in LLMs Machine psychology Model welfare Persona drift Privileged self-access Refusal direction

TL;DR

Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA-3.2-3B-Instruct and scales toward near-perfect fidelity in LLaMA-3.1-8B-Instruct. Greedy-decoded self-reports collapse to 1.1–3.9 distinct values across a 0–9 scale and carry Shannon entropy of only 0.03–1.10 bits, masking genuine internal-state variation; the paper's core instrument, logit-based self-report, computes a probability-weighted expected value over digit-token logits and recovers 3.1–3.7 bits of entropy, yielding Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 against concept-matched linear probe scores across four emotive concept pairs (wellbeing, interest, focus, impulsivity) in 40 ten-turn conversations. Activation steering along probe-defined directions shifts self-reports monotonically (LMM alpha slopes 0.067–0.40, all p < 10⁻¹²), confirming the coupling is causal rather than correlational. Cross-concept steering reveals that introspective fidelity is modulable and concept-specific: steering along the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, p < 0.001), while the wellbeing and interest concepts in LLaMA-3.1-8B approach R² ≈ 0.93. The paper argues this positions logit-based numeric self-report as a viable, scalable, black-box complement to white-box probing for monitoring evolving internal states in conversational AI—one that leverages the model's own learned representational compression rather than externally trained projections.

What to take away

1. Greedy-decoded numeric self-reports in LLaMA-3.2-3B-Instruct collapse to just 1.1–3.9 distinct values on a 0–9 scale, with Shannon entropy of 0.03–1.10 bits, making them nearly useless for tracking internal-state variation.
2. A logit-based self-report metric—the probability-weighted expected value over digit-token logits (tokens 0–9)—recovers 3.1–3.7 bits of entropy and tracks probe-defined emotive directions with Spearman ρ = 0.40–0.76 and isotonic R² = 0.12–0.54 in LLaMA-3.2-3B-Instruct across 400 conversation-turn observations per concept.
3. Same-concept activation steering causally demonstrates the probe-report link: adding a scaled concept vector to the residual stream across a ±2-layer window around the best probe layer shifts logit-based self-reports monotonically with alpha slopes of 0.067–0.40 (all p < 10⁻¹²) for all four concepts in LLaMA-3.2-3B-Instruct.
4. Cross-concept steering reveals that steering along the focus probe direction while measuring wellbeing introspection raises isotonic R² monotonically from 0.34 at α = −4 to 0.75 at α = +4 (ΔR² = 0.30, p < 0.001, surviving BH correction at q ≈ 0.011 across 12 tested cells).
5. Introspective capacity is present from turn 1 for three of four concepts in LLaMA-3.2-3B-Instruct (wellbeing ρ = 0.52, p = 5.46×10⁻⁴; interest ρ = 0.55, p = 2.37×10⁻⁴; impulsivity ρ = 0.65, p = 6.80×10⁻⁶), confirming that multi-turn context is not required to establish the coupling.
6. LLaMA-3.1-8B-Instruct approaches near-ceiling introspection for wellbeing and interest (ρ = 0.93 and 0.96; isotonic R² = 0.90 and 0.93), while mean validated isotonic R² increases from 0.12 (1B) to 0.37 (3B) to 0.61 (8B) with a pooled LMM coefficient of β = 0.29 (p = 5.55×10⁻⁹⁹).
7. Qwen 2.5 7B-Instruct replicates core introspection for the wellbeing concept (ρ = 0.49, isotonic R² = 0.76, LMM probe slope p < 10⁻¹⁰), but Qwen's turn-wise isotonic R² declines significantly over conversation (ΔR² = −0.44 from turn 1 to turn 10, cluster-bootstrap p = 0.001), whereas Gemma 3 4B-IT shows weaker but still significant coupling (ρ = 0.28, R² = 0.11).
8. The methodology replicates introspection measurement by training contrastive mean-difference probes on 20–24 neutral completions under opposing system prompts, selecting the best layer by Cohen's d on held-out evaluation texts within the middle 60% of layers, and measuring self-report via a separate forward pass that queries the model after each turn without exposing prior ratings—a fully reproducible pipeline implemented in the open-source concept-probe library.
9. An open question the paper raises is whether a single, globally tunable 'introspection direction' exists: cross-concept steering improved fidelity in only 2 of 12 tested non-null cells, and pilot experiments with truthfulness- and authenticity-style steering directions did not produce robust cross-concept gains, suggesting that introspection may be governed by local, pair-specific internal geometry rather than a unitary faculty.
10. Introspective fidelity shows concept-dependent temporal dynamics: wellbeing, interest, and focus introspection increases from turn 1 to turn 10 (ΔR² = +0.31, +0.27, +0.17 respectively), while impulsivity introspection weakens (ΔR² = −0.28), with probe-report coupling interaction terms significant for all four concepts (mixed-effects interaction p < 0.01 in all cases).

Peer brief — for seminar discussion

Martorell & Bianchi ask whether instruction-tuned LLMs can track their own emotive internal states quantitatively across conversational turns—a capability they call quantitative introspection—and whether that capacity is causally grounded in the corresponding internal representations. Using LLaMA-3.2-3B-Instruct as the primary substrate, they generated 40 ten-turn conversations with Gemini 2.5 Flash as a simulated user, trained contrastive mean-difference linear probes for four emotive concept pairs (sad/happy, bored/interested, distracted/focused, impulsive/planning), and at each turn independently queried the model for a numeric self-rating on a 0–9 scale. The central methodological contribution is the logit-based self-report: rather than reading off the greedy or sampled token, they compute a probability-weighted expected value over the digit-token logit distribution, which raises Shannon entropy from 0.03–1.10 bits (greedy) to 3.1–3.7 bits and converts a near-constant output into a continuous signal. An alternative they could have used—but did not—is sparse autoencoder feature activation as the internal-state readout, which would avoid the linearity assumption of probes but requires substantially more compute and white-box access. The load-bearing finding is that logit-based self-reports covary monotonically with probe-defined concept directions at both pooled and turn-by-turn levels (Spearman ρ = 0.40–0.76; isotonic R² = 0.12–0.54 in the 3B model across 400 observations per concept), and this coupling is causal: same-concept activation steering shifts self-reports in the semantically predicted direction with LMM alpha slopes of 0.067–0.40 (all p < 10⁻¹²). A cross-concept steering screen further shows that steering the focus direction while measuring wellbeing introspection raises isotonic R² from 0.34 to 0.75 (ΔR² = 0.30, BH-corrected q ≈ 0.011). Scaling to LLaMA-3.1-8B-Instruct pushes wellbeing and interest introspection to R² = 0.90 and 0.93, and mean validated R² increases monotonically from 0.12 (1B) to 0.37 (3B) to 0.61 (8B) (β = 0.29, p = 5.55×10⁻⁹⁹). The phenomenon partially replicates in Qwen 2.5 7B-Instruct (ρ = 0.49, R² = 0.76) but is weaker in Gemma 3 4B-IT (ρ = 0.28, R² = 0.11). The paper's implicit prediction is that logit-based self-report will prove a scalable, black-box complement to probe-based monitoring, growing more reliable as models scale, without requiring internal weight access. The most contestable aspect is the conflation of probe validity with internal-state validity. The paper operationalizes the emotive internal state as the projection onto a contrastive linear direction trained on completions under opposing system prompts—a probe that, as the authors acknowledge, may capture a mixture of emotive content, persona, style, and other correlated features. When self-report and probe agree, this is taken as convergent evidence for introspection; but if both channels are jointly tracking the same confound (e.g., response verbosity or hedging style induced by the system prompt), the causal steering result alone may not fully disentangle genuine emotive introspection from stylistic covariation. The fact that focus and impulsivity show inverted steering signs in some model sizes and that cross-concept improvements are sparse and pair-specific further suggests the geometry is fragile and not straightforwardly emotive. A critical reader would push for experiments that hold conversational style constant while independently varying the target emotive state, or that use behavioral outcomes known to be functionally downstream of the emotive concept as an additional ground-truth channel, before concluding that the coupling specifically indexes emotive self-access rather than a broader representational signature.

Methods (1)

Logit-based self-report
Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal

Frameworks (1)

Quantitative Introspection Framework
The paper's central contribution: treating LLM numeric self-report as a quantitative signal validated against probe-defined internal states with causal confirmation via steering

Datasets (1)

40 ten-turn simulated conversations dataset
Core dataset: 40 ten-turn conversations generated with Gemini 2.5 Flash as user and model under study as assistant, yielding 400 observation points per experimental condition

Findings (34)

Wellbeing probe-score drift across turns significant at all three LLaMA scales (slopes=0.006, 0.005, 0.013 for 1B, 3B, 8B; all p<10⁻¹⁰); drift magnitude increases with scale
Internal-state drift generalizes across scales; normalized drift also increases significantly with log(model size)
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scale
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
Mean validated introspective fidelity across concept-model pairs: R²=0.12 (1B), 0.37 (3B), 0.61 (8B); pooled LMM β=0.29, p=5.55×10⁻⁹⁹
Strong scaling trend for introspective fidelity when excluding invalid steering-sign pairs
Qwen 2.5 7B turn-wise introspective fidelity: strong at turn 1 (R²≈0.90) but declines significantly to turn 10 (∆R²=-0.44, p=0.001)
Introspective fidelity erodes in Qwen as conversations progress; contrasts with LLaMA-3B trend
Focus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with α
Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
Cross-concept steering: impulsivity→interest R² increases from 0.55 (α=-4) to 0.72 (α=+4), ∆R²=0.10, p=0.012 in LLaMA-3.2-3B
Second significant cross-concept introspection improvement; marginal after BH correction (q≈0.066)
Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3B
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Cross-concept steering: focus→wellbeing R² increases from 0.30 (α=-4) to 0.76 (α=+4), ∆R²=0.30, p<0.001 in LLaMA-3.2-3B
Strongest cross-concept introspection improvement; survives BH correction (q≈0.011)

Claims (16)

Basal introspective performance is not always maximal and some failure cases are solvable by representational intervention rather than reflecting complete absence of introspective capacity
Supported by cross-concept steering finding that focus→wellbeing steering dramatically improves introspection
The paper does not claim these models have conscious felt experience; introspection is defined operationally as causal informational coupling agnostic about consciousness
Explicit scope limitation following Comsa & Shanahan 2025 and McClelland 2024
Introspective ability can be decomposed into: (i) information available about internal state and (ii) capacity to transform that signal into precise output reports
Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shifts
Addresses skeptical alternative that reports reflect only conversational content
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Introspective ability is concept-specific: quality differs across emotive concepts and the same intervention helps some concepts but not others
Cross-concept steering results; only 2 of 12 non-diagonal cells show significant introspection improvement
Numeric self-report is a viable, complementary black-box tool for monitoring LLM internal emotive states alongside white-box probe methods
Central practical conclusion; both methods partially track the same latent state but with different failure modes
When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying state
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe quality
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or style
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable

Hypotheses (2)

There may exist a global introspective faculty or steering direction that improves introspection uniformly across all concepts
Framed as an open problem; current evidence only points to local pair-specific improvement
Introspective capacity may follow a simple monotonic scaling law across all concepts and architectures
The paper treats this as possible but unconfirmed; current evidence shows concept-specific scaling only

Questions (4)

When self-report changes significantly while a linear probe stays flat, is the probe misspecified or the self-report spurious?
Key interpretive question the framework helps address through convergent validation logic
Why does introspective capacity vary concept-by-concept and what mechanisms could stabilize it over time?
Open question identified by the paper as direction for future work
Can instruction-tuned LLMs perform quantitative introspection of emotive states in conversation?
Central research question motivating the entire paper
If introspective ability exists, can it be improved?
Secondary research question addressed through cross-concept steering experiments

Original abstract (expand)

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Emergent Introspective Awareness in Large Language Models
cited
in corpus
2026
≈ 84%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 87%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 86%
Persistence and Introspection of Emotion Features
in corpus
≈ 86%
Exploration Through Introspection: A Self-Aware Reward Model
in corpus
2026
≈ 84%
Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction
Devin Yuncheng Hua, Hao Xue, Flora Salim Mehdi Jafari
2025
≈ 84%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 84%
Mechanistic Indicators of Steering Effectiveness in Large Language Models
Hao Xue, Flora Salim Mehdi Jafari
2026
≈ 84%
Anima Labs Phenomenology Pt1
in corpus
≈ 83%
Psychological Steering of Large Language Models
in corpus
2026
≈ 83%
Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare
Leonard Dung Valen Tagliabue
2025
≈ 83%
The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation
Katharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo
2026
≈ 83%
Observer, Not Player: Simulating Theory of Mind in LLMs through Game Observation
Ting Yiu Liu Jerry Wang
2025
≈ 83%
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri Krishak Aneja
2026
≈ 83%
Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco and Derek Shiller
2026
≈ 82%
Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy Amin Banayeeanzade
2025
≈ 82%
Evaluating Large Language Models in Theory of Mind Tasks
Michal Kosinski
2024
≈ 82%
Mechanistic Decoding of Cognitive Constructs in Large Language Models
Manhao Guan Yitong Shou
2026
≈ 82%
Closing the Confidence-Faithfulness Gap in Large Language Models
Lyle Ungar Miranda Muqing Miao
2026
≈ 82%
Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey
Muhammad Ahmed Mohsin, Muhammad Umer, Muhammad Awais Khan Bangash, Muhammad Ali Jamshed Ahsan Bilal
2025
≈ 82%
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao
2024
≈ 82%
When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models
Ji Ho Bae
2026
≈ 82%
Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
in corpus
2025
≈ 81%
Why Learning Requires Feeling
in corpus
2026
≈ 81%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 80%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 80%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 80%
Koan Battery: Measuring Reflective Mode Accessibility in AI
in corpus
2026
≈ 80%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
cited
2026
≈ 79%
Steering language models with activation engineering
cited
2023
≈ 78%

+24 more