Feature attribution via gradient dot product with SAE decoder

Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gradient-based data attributionmethod0.800
Baseline method against which probe-based ranking is compared; more computationally expensive.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.798
A promising property for interpretability analysis off-distribution.
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.796
Out-of-distribution generalization of SAE features.
Sparse Autoencoders (SAE)method0.774
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.762
Extension of mechanistic interpretability findings to the metacognitive domain
TopK Sparse Autoencoders (SAEs)method0.757
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
SAE training loss (MSE + L1 penalty with decoder norm scaling)method0.755
The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.753
Validation of attribution as a fast proxy for causal importance.