method
active
method:sae-feature-conditional-firing-persistence-metricSAE Feature Conditional Firing Persistence Metric
P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
Neighborhood — ranked by edge-count
Findings (1)
finding
- Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
Methods (1)
method
- Persistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- The SAE self-evaluation persistence finding serves as a replication of probe-based results that shares no potential probe construction confounds
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- Highly active SAE feature with broad emotional modulation and large corpus presence
- The phenomenon that emotion feature activations remain elevated above baseline beyond local token bursts, measurable as long-range correlation
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes