finding
active
finding:17-of-83-tested-emotions-show-significant-association-between-self-eval-transcript-word-mention-and-cosine-similarity-to-emotion-probe17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probe
Validates that agentic self-evaluation captures genuine emotional content of probes
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (1)
claim
- Forward-looking claim about the broader utility of the self-steering evaluation method
Methods (1)
method
- Tests whether SAE features whose self-evaluation transcripts mention a specific emotion word have higher cosine similarity to that emotion probe
Findings (1)
finding
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claim supporting the validity of the probe construction method via cross-validation with self-report
- Core result of Experiment 3: cross-model semantic convergence under self-referential processing
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.793Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.793Shows low agreement between the two evaluation modalities
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Pearson-Vogel et al.: accurate self-description prompts increase introspective detection from 0.3% to 39.9%finding0.787Cited to mechanistically support why the contemplative prompt changes what post-training-shaped final layers allow through
- Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
- Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.