claim

active

claim:the-correlation-between-emotion-subspace-fraction-and-self-evaluated-emotionality-validates-that-emotion-probe-concepts-somewhat-overlap-with-the-model-s-self-reported-internal-emotions

The correlation between emotion subspace fraction and self-evaluated emotionality validates that emotion probe concepts somewhat overlap with the model's self-reported internal emotions.

Claim supporting the validity of the probe construction method via cross-validation with self-report

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Papers (1)

paper

Persistence and Introspection of Emotion Features
introduces

Findings (1)

finding

17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.
supports
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The relationship between persistence and self-evaluated emotionality serves as a replication of probe-based findings without shared confounds from probe constructionclaim0.820
Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probefinding0.815
Validates that agentic self-evaluation captures genuine emotional content of probes
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.813
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.810
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.800
Explains why variance correction is needed to see the self-evaluation–persistence relationship
Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.800
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
Is self-evaluation successful in measuring emotion?question0.795
Question addressed by testing whether self-evaluation transcripts mentioning emotion words have higher cosine similarity to corresponding probes
Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or styleclaim0.794
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable