question

active

question:when-self-report-changes-significantly-while-a-linear-probe-stays-flat-is-the-probe-misspecified-or-the-self-report-spurious

When self-report changes significantly while a linear probe stays flat, is the probe misspecified or the self-report spurious?

Key interpretive question the framework helps address through convergent validation logic

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Papers (1)

paper

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.824
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying stateclaim0.819
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.762
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)claim0.762
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.758
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.757
Geometric evidence for convergence to stable truth directions only for simpler tasks.
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.757
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
The relationship between persistence and self-evaluated emotionality serves as a replication of probe-based findings without shared confounds from probe constructionclaim0.756
Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link