claim

active

claim:sae-based-persistence-replication-of-probe-based-findings-no-shared-probe-confounds

SAE-based persistence replication of probe-based findings (no shared probe confounds)

The SAE self-evaluation persistence finding serves as a replication of probe-based results that shares no potential probe construction confounds

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Papers (1)

paper

Persistence and Introspection of Emotion Features
introduces

Claims (1)

claim

Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.
supports
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The relationship between persistence and self-evaluated emotionality serves as a replication of probe-based findings without shared confounds from probe constructionclaim0.807
Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.779
Extension of mechanistic interpretability findings to the metacognitive domain
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.777
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.774
Claim that feature grounding enables interpretability metrics.
SAE Feature Conditional Firing Persistence Metricmethod0.759
P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
Persistence is not an artifact of probe construction because lower (more central) emotion PCs are more persistent than noisier high-rank PCsclaim0.752
Rules out measurement artifact explanation for the persistence finding
SAE feature emotion subspace overlap correlates with persistence in Cogito: Spearman +0.413, p=4.4e-196finding0.749
Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
SAE emotion subspace overlap correlates with variance-residualized persistence in Cogito: Spearman +0.413, p = 4.4e-196.finding0.747
Strong positive relationship between emotion alignment and SAE feature persistence in Cogito