Scaled SAE training on Claude 3 Sonnet middle residual stream layer

Specific application of SAE to extract features from the middle layer of Claude 3 Sonnet, at three scales (1M, 4M, 34M features).

Neighborhood — ranked by edge-count

Methods (1)

method

Sparse Autoencoders (SAE)
extends
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

On SWE-bench, Claude Opus 4.6 and Claude Sonnet 4.6 both achieve 7.4 pp harness-updating gain; Claude Haiku 4.5 achieves 8.0 ppfinding0.751
Full evolver-side SWE results showing comparable performance across Claude family tiers
SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baselinefinding0.747
Empirical demonstration that SAE projections produce divergent representations in a real LLM
SAE training loss (MSE + L1 penalty with decoder norm scaling)method0.744
The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.744
Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
The middle layer residual stream features are causally implicated in multi-step reasoning.claim0.734
Features for Kobe Bryant, California, Lakers participate in computing the capital answer.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.731
A promising property for interpretability analysis off-distribution.
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.727
Out-of-distribution generalization of SAE features.
Claude 3.5 Sonnet shows higher rate of alignment-faking reasoning than Claude 3 Opus in helpful-only setting but almost none in animal welfarefinding0.719
Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences