claim
active
claim:when-probe-and-self-report-agree-and-move-together-causally-confidence-in-both-increases-as-evidence-they-track-the-same-underlying-stateWhen probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying state
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Convergent validity logicimplementsFramework borrowed from human metacognition research: when probe and self-report agree, confidence in both increases as they partially track the same underlying state
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key interpretive question the framework helps address through convergent validation logic
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable
- The reciprocal effect: doing the test deepens self-knowledge and judgment.
- Addresses skeptical alternative that reports reflect only conversational content
- Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II