One-Sided Permutation Test for Emotion Word Mention

Tests whether SAE features whose self-evaluation transcripts mention a specific emotion word have higher cosine similarity to that emotion probe

Neighborhood — ranked by edge-count

Findings (1)

finding

17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probe
introduces
Validates that agentic self-evaluation captures genuine emotional content of probes

Methods (1)

method

Benjamini-Hochberg FDR correction
uses
Multiple testing correction applied to significance tests of emotion persistence and self-evaluation word associations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

One-sided permutation testmethod0.846
Statistical test used to evaluate whether SAE features mentioning an emotion word have higher cosine similarity to that emotion probe
One-Sided Paired-Samples t-testmethod0.750
Statistical test used to assess significance of introspective agent improvement over no-pain baseline
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.722
Suggestive evidence for language-independent truth representation in LLMs
Pair Comparison / Which-is-more-like-my-eternal-self Testmethod0.715
The iterative method Alexander uses to make design decisions: compare two versions and ask which is more a picture of one's own eternal self, repeating until convergence.
Self-awareness score ordering in Experiment 4: History < Conceptual < Zero-Shot < Experimental, consistent across model familiesfinding0.708
Cross-model consistency of the condition ordering in Experiment 4
Emotion may refer to a state, and more stateful concepts in general tend to be more persistent across tokens than non-stateful onesclaim0.708
Proposed mechanistic explanation for why emotion features are more persistent
A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgmentsfinding0.707
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.707
Surprising finding that the two evaluation methods diverge in their relationship with persistence