Attribution graph construction

Method to trace how parameter subcomponents interact from input to output for a given next-token prediction, producing a subnetwork graph.

Neighborhood — ranked by edge-count

paper

method

Attribution Graphs
related_tosame_as
Gradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Data Attributionconcept0.780
The task of attributing model behaviors to specific training datapoints.
Gradient-based data attributionmethod0.772
Baseline method against which probe-based ranking is compared; more computationally expensive.
Attribution patchingmethod0.770
Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)finding0.759
Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Attribution Similaritymethod0.755
Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
Graph Geometryconcept0.745
A more complex geometric structure used to characterize in-context learning task representations
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsfinding0.724
Second component of the subnetwork for 'her', complementing the femaleness signal.
Wolfram Causal Graphframework0.721
A framework from Wolfram physics viewing computation as a causal graph with foliations/time-slices specifying computation order.