claim

active

claim:sae-features-generalize-to-images-despite-training-only-on-text-indicating-out-of-distribution-robustness

SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.

A promising property for interpretability analysis off-distribution.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.888
Out-of-distribution generalization of SAE features.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.809
Extension of mechanistic interpretability findings to the metacognitive domain
Feature attribution via gradient dot product with SAE decodermethod0.798
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.798
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.793
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.786
Claim that feature grounding enables interpretability metrics.
Sparse Autoencoders (SAE)method0.781
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAE training loss decreases as a power law with compute budget when using compute-optimal hyperparameters.finding0.774
From scaling laws sweep.