claim
active
claim:the-correlation-between-emotion-subspace-fraction-and-self-evaluated-emotionality-validates-that-emotion-probe-concepts-somewhat-overlap-with-the-model-s-self-reported-internal-emotionsThe correlation between emotion subspace fraction and self-evaluated emotionality validates that emotion probe concepts somewhat overlap with the model's self-reported internal emotions.
Claim supporting the validity of the probe construction method via cross-validation with self-report
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
- Validates that agentic self-evaluation captures genuine emotional content of probes
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.800Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Question addressed by testing whether self-evaluation transcripts mentioning emotion words have higher cosine similarity to corresponding probes
- Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable