claim

active

claim:numeric-self-report-is-a-viable-complementary-black-box-tool-for-monitoring-llm-internal-emotive-states-alongside-white-box-probe-methods

Numeric self-report is a viable, complementary black-box tool for monitoring LLM internal emotive states alongside white-box probe methods

Central practical conclusion; both methods partially track the same latent state but with different failure modes

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Findings (7)

finding

Gemma 3 4B-IT wellbeing introspection: ρ=0.28, isotonic R²=0.11 (LMM p=1.33×10⁻¹³)
supports
Weaker but still significant introspective coupling in Gemma model; consistent with lower probe quality
Interest concept: Spearman ρ=0.76, isotonic R²=0.54 between logit self-report and probe score in LLaMA-3.2-3B (n=400)
supports
Strongest pooled introspective coupling across the four emotive concepts in the primary model
Qwen 2.5 7B-Instruct wellbeing introspection: ρ=0.49, isotonic R²=0.76 (LMM p<10⁻¹⁰)
supports
Strong introspective coupling in Qwen model; demonstrates cross-family generalization of introspective capacity
Focus concept: Spearman ρ=0.40, isotonic R²=0.12 in LLaMA-3.2-3B (n=400, p<10⁻⁵)
supports
Weakest but still significant pooled introspective coupling in primary model
Impulsivity concept: Spearman ρ=0.51, isotonic R²=0.31 in LLaMA-3.2-3B (n=400, p<10⁻¹²)
supports
Third-strongest pooled introspective coupling in primary model
Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)
supports
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
Wellbeing concept: Spearman ρ=0.68, isotonic R²=0.48 in LLaMA-3.2-3B (n=400, p<10⁻²⁶)
supports
Second-strongest pooled introspective coupling in primary model

Claims (2)

claim

LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)
contradicts
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Standardized LLM self-assessments reflect learned communication postures rather than genuine capabilities (Jackson et al. 2025)
contradicts
Skeptical prior work motivating validation framework

Questions (1)

question

Can instruction-tuned LLMs perform quantitative introspection of emotive states in conversation?
gates
Central research question motivating the entire paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The coupling between LLM self-report and internal emotive state is causal, not merely correlationalclaim0.825
Supported by same-concept steering experiments showing monotonic shifts in self-report with activation steering
LLM self-reports about consciousness and moral significance should express degrees of confidence and provide context.claim0.791
Recommendation for companies on LM outputs.
Numeric self-reportmethod0.782
Primary tool in human psychometrics for tracking latent internal states; adapted as the core measure in this paper for LLMs
Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.777
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.763
Normative-scientific claim about the alignment implications of Experiment 2's findings
The correlation between emotion subspace fraction and self-evaluated emotionality validates that emotion probe concepts somewhat overlap with the model's self-reported internal emotions.claim0.762
Claim supporting the validity of the probe construction method via cross-validation with self-report
Our central claim is deliberately limited. We do not claim that these models have conscious felt experience, nor that a numeric self-report gives direct access to anything like human phenomenology.quote0.761
Explicit scope delimitation that situates the paper's claims within interpretability rather than consciousness science
When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.759
The core interpretive question the paper narrows but cannot definitively answer