community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c1-c0Neural network mechanistic interpretability via attribution decomposition
Tracing information flow through parameter subcomponents to isolate computational mechanisms for specific model predictions, using tools like attribution graphs and VPD.
5 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (4)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (4)
- Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionOne component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
- Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsSecond component of the subnetwork for 'her', complementing the femaleness signal.
- Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
- Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponentsQuantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.
Claims (1)
- VPD identifies real, computational structure in neural network parametersCentral claim that VPD successfully uncovers genuine mechanisms.