finding
active
finding:automated-interpretability-claude-3-opus-and-specificity-scoring-show-sae-features-are-significantly-more-interpretable-and-specific-than-mlp-neuronsAutomated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.
Quantitative comparison supporting SAE utility.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SAE features are not simply mirroring individual neurons.
- Claim that feature grounding enables interpretability metrics.
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
- Out-of-distribution generalization of SAE features.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Extension of mechanistic interpretability findings to the metacognitive domain