method
active
method:attribution-graph-constructionAttribution graph construction
Method to trace how parameter subcomponents interact from input to output for a given next-token prediction, producing a subnetwork graph.
Neighborhood — ranked by edge-count
Papers (1)
paper
- Interpreting Language Model Parametersintroduces
Methods (1)
method
- Attribution Graphsrelated_tosame_asGradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The task of attributing model behaviors to specific training datapoints.
- Baseline method against which probe-based ranking is compared; more computationally expensive.
- Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
- Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
- Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
- A more complex geometric structure used to characterize in-context learning task representations
- Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsfinding0.724Second component of the subnetwork for 'her', complementing the femaleness signal.
- A framework from Wolfram physics viewing computation as a causal graph with foliations/time-slices specifying computation order.