Feature ablation (zeroing feature activations)

Clamping a feature's value to zero to measure its causal effect on model output.

Neighborhood — ranked by edge-count

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Zero Ablationmethod0.822
Intervention type that sets activations to zero, used for interpretability analysis
Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.claim0.770
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
Mechanism by which activation of an emotion feature sometimes leads to later suppression of that same featurequestion0.748
Identified research gap: the paper observes anti-persistence but has no explanation for it
L0 Norm of Feature Activationsconcept0.745
Average number of nonzero feature entries per input; primary measure of activation sparsity in the autoencoder
Feature steering (clamping feature activations)method0.743
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Noising/Denoising Activation Patchingmethod0.741
Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits
Why does activation of an emotion feature sometimes lead to its later suppression?question0.736
Open mechanistic question arising from the causal steering experiment
Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.733
Validation of attribution as a fast proxy for causal importance.