finding

active

finding:17-of-83-tested-emotions-show-significant-association-between-self-eval-transcript-word-mention-and-cosine-similarity-to-emotion-probe

17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probe

Validates that agentic self-evaluation captures genuine emotional content of probes

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Agentic self-steering evaluation may serve as a general method for explaining and interpreting SAE features beyond emotion
supports
Forward-looking claim about the broader utility of the self-steering evaluation method

Methods (1)

method

One-Sided Permutation Test for Emotion Word Mention
introduces
Tests whether SAE features whose self-evaluation transcripts mention a specific emotion word have higher cosine similarity to that emotion probe

Findings (1)

finding

17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.
restates
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The correlation between emotion subspace fraction and self-evaluated emotionality validates that emotion probe concepts somewhat overlap with the model's self-reported internal emotions.claim0.815
Claim supporting the validity of the probe construction method via cross-validation with self-report
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.794
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.793
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.793
Shows low agreement between the two evaluation modalities
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.791
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.787
Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)finding0.784
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
48 of 171 emotion probes individually significant at token 100 post-steeringfinding0.783
Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.