finding
active
finding:for-all-three-saes-1m-4m-34m-average-active-features-per-token-300-and-reconstruction-variance-explained-65For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.
Basic SAE performance metrics.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
- Most features dead in largest SAE, indicating room for improvement.
- SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.774High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
- SAE features are not simply mirroring individual neurons.
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Table 2, row 3, showing equivalence when prior preferences match rewards.
- Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.760Demonstrates universality of the Arabic script feature across two independently trained transformers
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.759Shows low agreement between the two evaluation modalities