method
active
method:feature-attribution-via-gradient-dot-product-with-sae-decoder

Feature attribution via gradient dot product with SAE decoder

Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.