VPD identifies real, computational structure in neural network parameters

Central claim that VPD successfully uncovers genuine mechanisms.

Source paper

extracted_from

finding

Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attention
supports
One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronouns
supports
Second component of the subnetwork for 'her', complementing the femaleness signal.
Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponents
supports
Quantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic interpretability via parameter decomposition
members_of
Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.
Neural network mechanistic interpretability via attribution decomposition
members_of
Tracing information flow through parameter subcomponents to isolate computational mechanisms for specific model predictions, using tools like attribution graphs and VPD.
Virtually Planned Decomposition interpretability
members_of
VPD as a bottom-up method for identifying real computational structure in neural networks

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

VPD can be arbitrarily applied to any neural network architectureclaim0.805
Claim of generality, highlighted as a key strength.
The ability to make precise edits demonstrates that VPD identifies real computational machineryclaim0.782
Claim that editing success validates VPD's decomposition.
VPD (adVersarial Parameter Decomposition)concept0.755
Core methodological framework introduced in this paper; decomposes weight matrices into rank-one interpretable subcomponents using adversarial ablations.
VPD enables manual model editing through direct parameter manipulation.claim0.746
Applied capability claim: VPD enables surgical changes to model behaviour at the parameter level.
VPD scales to a 4-layer 67M-parameter model trained on The Pile.finding0.746
Empirical demonstration of VPD on a mid-scale transformer, establishing feasibility.
VPD subcomponents are sparse, interpretable, and avoid feature splitting.claim0.734
Assertion about the qualitative advantages of VPD's rank-one decomposition.
geometric structure in neural network representations drives model behaviorclaim0.730
Interpretive assertion that representation geometry is not epiphenomenal but causally shapes what models do externally.
Adversarial Parameter Decomposition (VPD)method0.728
Core technique introduced in this paper for decomposing neural network weight matrices into mechanistically simple, interpretable rank-one subcomponents.