method
active
method:one-sided-permutation-test-for-emotion-word-mentionOne-Sided Permutation Test for Emotion Word Mention
Tests whether SAE features whose self-evaluation transcripts mention a specific emotion word have higher cosine similarity to that emotion probe
Neighborhood — ranked by edge-count
Findings (1)
finding
- Validates that agentic self-evaluation captures genuine emotional content of probes
Methods (1)
method
- Multiple testing correction applied to significance tests of emotion persistence and self-evaluation word associations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Statistical test used to evaluate whether SAE features mentioning an emotion word have higher cosine similarity to that emotion probe
- Statistical test used to assess significance of introspective agent improvement over no-pain baseline
- Suggestive evidence for language-independent truth representation in LLMs
- The iterative method Alexander uses to make design decisions: compare two versions and ask which is more a picture of one's own eternal self, repeating until convergence.
- Cross-model consistency of the condition ordering in Experiment 4
- Proposed mechanistic explanation for why emotion features are more persistent
- Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
- Surprising finding that the two evaluation methods diverge in their relationship with persistence