finding

active

finding:sae-features-trained-on-text-activations-generalize-to-image-inputs-activating-on-relevant-visual-depictions

SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.

Out-of-distribution generalization of SAE features.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
supports
Features respond to concepts across languages and in images, not just text.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.888
A promising property for interpretability analysis off-distribution.
Sparse Autoencoders (SAE) activation-based paradigmframework0.828
Standard interpretability approach that VPD critiques and proposes an alternative to.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.817
Extension of mechanistic interpretability findings to the metacognitive domain
Feature attribution via gradient dot product with SAE decodermethod0.796
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.793
Claim that feature grounding enables interpretability metrics.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.790
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Sparse Autoencoders (SAE)method0.789
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.782
Quantitative comparison supporting SAE utility.