finding
active
finding:feature-attribution-gradient-based-correlates-0-8-with-ablation-effects-on-the-john-and-kobe-examplesFeature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.
Validation of attribution as a fast proxy for causal importance.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Baseline method against which probe-based ranking is compared; more computationally expensive.
- The magnitude of the normalized gradients (choice of αk) plays an important role in performance.claim0.760Insight about gradient normalization scaling.
- Table 2, row 3, showing equivalence when prior preferences match rewards.
- We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.753Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
- Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
- One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
- Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsfinding0.747Second component of the subnetwork for 'her', complementing the femaleness signal.
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.