finding
active
finding:sae-features-trained-on-text-activations-generalize-to-image-inputs-activating-on-relevant-visual-depictionsSAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.
Out-of-distribution generalization of SAE features.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Features respond to concepts across languages and in images, not just text.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A promising property for interpretability analysis off-distribution.
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
- Claim that feature grounding enables interpretability metrics.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Quantitative comparison supporting SAE utility.