method
active
method:scaled-sae-training-on-claude-3-sonnet-middle-residual-stream-layerScaled SAE training on Claude 3 Sonnet middle residual stream layer
Specific application of SAE to extract features from the middle layer of Claude 3 Sonnet, at three scales (1M, 4M, 34M features).
Neighborhood — ranked by edge-count
Methods (1)
method
- Sparse Autoencoders (SAE)extendsInterpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Full evolver-side SWE results showing comparable performance across Claude family tiers
- SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baselinefinding0.747Empirical demonstration that SAE projections produce divergent representations in a real LLM
- The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.744Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
- The middle layer residual stream features are causally implicated in multi-step reasoning.claim0.734Features for Kobe Bryant, California, Lakers participate in computing the capital answer.
- A promising property for interpretability analysis off-distribution.
- Out-of-distribution generalization of SAE features.
- Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences