finding

active

finding:feature-attribution-gradient-based-correlates-0-8-with-ablation-effects-on-the-john-and-kobe-examples

Feature attribution (gradient-based) correlates 0.8 with ablation effects on the 'John' and 'Kobe' examples.

Validation of attribution as a fast proxy for causal importance.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.
supports
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gradient-based data attributionmethod0.779
Baseline method against which probe-based ranking is compared; more computationally expensive.
The magnitude of the normalized gradients (choice of αk) plays an important role in performance.claim0.760
Insight about gradient normalization scaling.
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.757
Table 2, row 3, showing equivalence when prior preferences match rewards.
We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.753
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
Feature attribution via gradient dot product with SAE decodermethod0.753
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionfinding0.750
One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsfinding0.747
Second component of the subnetwork for 'her', complementing the femaleness signal.
Probe-based data attribution effectively reduces harmful behaviors via data interventionsclaim0.746
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.