claim

active

claim:when-probe-and-self-report-agree-and-move-together-causally-confidence-in-both-increases-as-evidence-they-track-the-same-underlying-state

When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying state

Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Concepts (1)

concept

Convergent validity logic
implements
Framework borrowed from human metacognition research: when probe and self-report agree, confidence in both increases as they partially track the same underlying state

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

When self-report changes significantly while a linear probe stays flat, is the probe misspecified or the self-report spurious?question0.819
Key interpretive question the framework helps address through convergent validation logic
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.819
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
The relationship between persistence and self-evaluated emotionality serves as a replication of probe-based findings without shared confounds from probe constructionclaim0.815
Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.814
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or styleclaim0.792
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable
Performing the mirror-of-the-self test gradually brings the observer closer to contact with the original mind.claim0.787
The reciprocal effect: doing the test deepens self-knowledge and judgment.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.785
Addresses skeptical alternative that reports reflect only conversational content
Are high-accuracy probe representations also causally relevant for the task?question0.784
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II