claim
active
claim:sae-based-persistence-replication-of-probe-based-findings-no-shared-probe-confoundsSAE-based persistence replication of probe-based findings (no shared probe confounds)
The SAE self-evaluation persistence finding serves as a replication of probe-based results that shares no potential probe construction confounds
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
- Extension of mechanistic interpretability findings to the metacognitive domain
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Claim that feature grounding enables interpretability metrics.
- P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
- Rules out measurement artifact explanation for the persistence finding
- SAE feature emotion subspace overlap correlates with persistence in Cogito: Spearman +0.413, p=4.4e-196finding0.749Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
- Strong positive relationship between emotion alignment and SAE feature persistence in Cogito