method
active
method:feature-attribution-via-gradient-dot-product-with-sae-decoderFeature attribution via gradient dot product with SAE decoder
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Baseline method against which probe-based ranking is compared; more computationally expensive.
- A promising property for interpretability analysis off-distribution.
- Out-of-distribution generalization of SAE features.
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.
- The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
- Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.753Validation of attribution as a fast proxy for causal importance.