claim
active
claim:even-validated-probes-may-capture-distributed-representations-mixing-emotive-states-with-correlated-features-like-persona-or-styleEven validated probes may capture distributed representations mixing emotive states with correlated features like persona or style
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Methods (1)
method
- Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Claim supporting the validity of the probe construction method via cross-validation with self-report
- Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
- Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
- Rules out measurement artifact explanation for the persistence finding
- The double standard pointed out by S&C and endorsed by the authors.