method
active
method:sae-feature-firing-probability-persistence-metricSAE feature firing probability persistence metric
Persistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.754High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
- The SAE self-evaluation persistence finding serves as a replication of probe-based results that shares no potential probe construction confounds
- Fraction of an SAE feature's length lying inside the 171-dimensional subspace spanned by emotion probes, computed via SVD orthogonalization
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Measures emotion feature persistence as correlation between z-scored activation at token 0 and token 100 across all eligible target model tokens
- Baseline persistence of any probe direction arising from the autoregressive nature of LLMs, not specific to emotion content