SAE feature firing probability persistence metric

Persistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)

Neighborhood — ranked by edge-count

Papers (1)

paper

Persistence and Introspection of Emotion Features
introduces

Methods (1)

method

SAE Feature Conditional Firing Persistence Metric
related_to
P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE featuresconcept0.755
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.754
High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
SAE-based persistence replication of probe-based findings (no shared probe confounds)claim0.745
The SAE self-evaluation persistence finding serves as a replication of probe-based results that shares no potential probe construction confounds
SAE Feature Emotion Subspace Overlap Metricmethod0.743
Fraction of an SAE feature's length lying inside the 171-dimensional subspace spanned by emotion probes, computed via SVD orthogonalization
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.741
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.733
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Token-100 correlation persistence metricmethod0.730
Measures emotion feature persistence as correlation between z-scored activation at token 0 and token 100 across all eligible target model tokens
autoregressive persistenceconcept0.730
Baseline persistence of any probe direction arising from the autoregressive nature of LLMs, not specific to emotion content