finding

active

finding:for-all-three-saes-1m-4m-34m-average-active-features-per-token-300-and-reconstruction-variance-explained-65

For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.

Basic SAE performance metrics.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE sparse features (100K+ features, 64 active per token)concept0.822
The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
34M SAE had roughly 65% dead features.finding0.802
Most features dead in largest SAE, indicating room for improvement.
SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.774
High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.774
SAE features are not simply mirroring individual neurons.
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.772
Explains why variance correction is needed to see the self-evaluation–persistence relationship
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.769
Table 2, row 3, showing equivalence when prior preferences match rewards.
Arabic feature A/1/3450 and B/1/1334 have activation correlation of 0.91 across 40M tokensfinding0.760
Demonstrates universality of the Arabic script feature across two independently trained transformers
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.759
Shows low agreement between the two evaluation modalities