method
active
method:feature-ablation-zeroing-feature-activationsFeature ablation (zeroing feature activations)
Clamping a feature's value to zero to measure its causal effect on model output.
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Intervention type that sets activations to zero, used for interpretability analysis
- Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
- Mechanism by which activation of an emotion feature sometimes leads to later suppression of that same featurequestion0.748Identified research gap: the paper observes anti-persistence but has no explanation for it
- Average number of nonzero feature entries per input; primary measure of activation sparsity in the autoencoder
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits
- Open mechanistic question arising from the causal steering experiment
- Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.finding0.733Validation of attribution as a fast proxy for causal importance.