SAE Feature Conditional Firing Persistence Metric

P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations

Neighborhood — ranked by edge-count

Findings (1)

finding

SAE feature emotion subspace overlap correlates with persistence in Cogito: Spearman +0.413, p=4.4e-196
introduces
Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control

Methods (1)

method

SAE feature firing probability persistence metric
related_to
Persistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.761
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE-based persistence replication of probe-based findings (no shared probe confounds)claim0.759
The SAE self-evaluation persistence finding serves as a replication of probe-based results that shares no potential probe construction confounds
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.744
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
SAE featuresconcept0.739
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
SAE feature #92372 (fires 666,235 times in corpus) modulates a dimension related to urgency/pressure vs. patience/spaciousness in Kimi K2.5.finding0.738
Highly active SAE feature with broad emotional modulation and large corpus presence
emotion feature persistenceconcept0.735
The phenomenon that emotion feature activations remain elevated above baseline beyond local token bursts, measurable as long-range correlation
Sparse Autoencoders (SAE) activation-based paradigmframework0.734
Standard interpretability approach that VPD critiques and proposes an alternative to.
Sequential SAE Activation Analysismethod0.733
Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes