Neural network mechanistic interpretability via attribution decomposition

Tracing information flow through parameter subcomponents to isolate computational mechanisms for specific model predictions, using tools like attribution graphs and VPD.

5 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Paper Summary: Interpreting Language Model Parameters5 members

Bridges (4)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation5 shared
Mechanistic interpretability via parameter decomposition5 shared
Attribution graphs for transformer circuits3 shared
Virtually Planned Decomposition interpretability1 shared

Findings (4)

Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionOne component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsSecond component of the subnetwork for 'her', complementing the femaleness signal.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponentsQuantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.

Claims (1)

VPD identifies real, computational structure in neural network parametersCentral claim that VPD successfully uncovers genuine mechanisms.