question
active
question:when-self-report-changes-significantly-while-a-linear-probe-stays-flat-is-the-probe-misspecified-or-the-self-report-spuriousWhen self-report changes significantly while a linear probe stays flat, is the probe misspecified or the self-report spurious?
Key interpretive question the framework helps address through convergent validation logic
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link