claim
active
claim:sae-features-generalize-to-images-despite-training-only-on-text-indicating-out-of-distribution-robustnessSAE features generalize to images despite training only on text, indicating out-of-distribution robustness.
A promising property for interpretability analysis off-distribution.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Out-of-distribution generalization of SAE features.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
- Claim that feature grounding enables interpretability metrics.
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- SAE training loss decreases as a power law with compute budget when using compute-optimal hyperparameters.finding0.774From scaling laws sweep.