paper
active
paper:paper

Paper Summary: Interpreting Language Model Parameters

TL;DR

Adversarial Parameter Decomposition (VPD) decomposes a language model's weight matrices directly into rank-one subcomponents, recovering ~10,000 interpretable parameter pieces across all 24 matrices of a 67M-parameter language model without supervision. Unlike sparse autoencoders, which operate on activations rather than weights, VPD uses gradient descent jointly over the subcomponents and an auxiliary causal importance network that predicts, per prompt, which subcomponents are causally necessary to reproduce the model's behavior. Applied to attention layers — historically the hardest target for mechanistic interpretability — VPD recovers algorithms distributed across multiple attention heads, including previous-token behavior and syntax-boundary routing, identifying them as pairs of query and key subcomponents acting in concert. The model-editing demonstration is directly diagnostic: locating a single subcomponent responsible for emoticon-eye recognition and replacing its output with the unembedding vector for 'o' produces a model that predicts all emoticons as shocked faces, with off-target effects comparable to fine-tuning. Attribution graphs over the recovered subnetwork for a gendered-pronoun prediction task reveal two interpretable pathways — a femaleness signal propagated from 'princess' and a verb-triggered object-pronoun upweighting from 'lost'. The paper argues this constitutes evidence that language model parameters are not irreducibly complex, and that bottom-up interpretability grounded in the model's own computational structure, rather than human-imposed abstractions, is achievable and scalable.

What to take away

  1. 1. VPD (Adversarial Parameter Decomposition) decomposes all 24 weight matrices of a 67M-parameter language model into approximately 10,000 rank-one subcomponents via unsupervised gradient descent.
  2. 2. Each rank-one subcomponent is constrained to sum exactly to the original weight matrix it decomposes, ensuring the decomposition is a faithful partition of the model's parameters rather than an approximation.
  3. 3. An auxiliary causal importance network is trained jointly with the subcomponents to predict, per prompt, the minimum subset of subcomponents causally necessary to reproduce the target model's output.
  4. 4. Adversarial stress-testing actively searches for prompt combinations that break the causal importance network's predictions of which subcomponents are irrelevant, hardening the reliability of the causal labels.
  5. 5. VPD recovers attention algorithms distributed across multiple attention heads — including previous-token behavior and syntax-boundary routing — as pairs of query and key subcomponents, a capability that prior mechanistic interpretability methods could not achieve automatically.
  6. 6. A single identified subcomponent handling emoticon-eye recognition (e.g., ';', ':', '=') was edited by replacing its output with the model's unembedding vector for 'o', producing a model predicting all emoticons as shocked faces with off-target effects comparable to fine-tuning.
  7. 7. The subcomponent at index L2.MLP.down:3382 activates at 0.00% density but reliably fires on punctuation characters such as ':', ';', and '=' that precede text-based emoticons, illustrating extreme specialization within the decomposition.
  8. 8. An attribution graph over the subnetwork for predicting 'her' in 'the princess lost her crown' reveals two interpretable pathways: one routing a femaleness signal from 'princess' via attention and one upweighting object pronouns from the verb 'lost'.
  9. 9. An open question the paper raises is whether VPD's rank-one constraint and causal importance framework will remain sufficient to recover interpretable structure when scaled to models with billions rather than millions of parameters.
  10. 10. To replicate the decomposition pipeline, a researcher would jointly train rank-one subcomponents summing to original weights and an auxiliary causal importance network using gradient descent, then adversarially perturb causally-unimportant subcomponents to stress-test the selection.

Peer brief — for seminar discussion

Goodfire's VPD paper attacks a problem that activation-centric interpretability sidesteps entirely: what computational structure actually lives inside a language model's weight matrices, as opposed to in its runtime activations. The method introduced — Adversarial Parameter Decomposition (VPD) — decomposes each weight matrix into rank-one subcomponents that must sum exactly to the original matrix, are trained via gradient descent to be causally important as rarely as possible, and are jointly optimized with an auxiliary causal importance network that predicts per-prompt which subcomponents are necessary. An adversarial loop then stress-tests those causal labels by searching for prompt combinations that expose false negatives. The entire pipeline is applied to a 67M-parameter language model, decomposing all 24 weight matrices into roughly 10,000 subcomponents. The load-bearing finding is that these subcomponents are interpretable and functionally specific. The emoticon-eye subcomponent (L2.MLP.down:3382, activating at 0.00% density) fires exclusively on punctuation like ':', ';', and '=' that precede emoticons. Editing this single subcomponent to output the unembedding vector for 'o' shifts the model to predict shocked-face emoticons universally, with off-target collateral damage comparable to fine-tuning — a direct demonstration that VPD isolates real computational machinery rather than epiphenomenal structure. On attention layers, VPD recovers previous-token and syntax-boundary-routing algorithms distributed across multiple heads, something no prior automatic method could do. Attribution graphs for the pronoun prediction 'the princess lost her crown' expose two interpretable pathways in the recovered 10,000-component subnetwork. The implication is that parameter space contains latent, decomposable algorithmic structure, and that bottom-up interpretability — explaining computation in the network's own terms — is tractable. The authors predict future versions of VPD will scale to larger models while remaining architecturally identical in spirit, and frame the long-run goal as understanding networks well enough to design them intentionally. A critical reader would push back on the rank-one constraint as a strong architectural prior. Requiring subcomponents to be rank-one guarantees mechanistic simplicity by construction, but it is not obvious that the model's true functional units respect this constraint; the decomposition may be recovering interpretable projections of higher-rank structures rather than the structures themselves. An alternative approach — using sparse dictionary learning directly on weight matrices rather than on activations, without the rank-one restriction — could test whether the interpretability results survive relaxing that assumption. The paper also demonstrates editing and attribution only on one 67M-parameter model trained on IRC/forum text, leaving open whether the ~10,000-component count or the semantic specificity of components generalizes to instruction-tuned or RLHF-trained models at the scale where safety-relevant behaviors actually reside.

Findings (12)

Claims (16)

Hypotheses (1)

Questions (3)

Original abstract (expand)

This paper introduces Adversarial Parameter Decomposition (VPD), a technique for interpreting language model parameters by decomposing weight matrices into simple, understandable rank-one components. By splitting the model's parameters into interpretable pieces, the authors can identify algorithms implemented in attention layers, edit model behavior without retraining, and recover small subnetworks responsible for specific behaviors. The approach represents a bottom-up form of interpretability that explains computation in the model's own terms rather than imposing external abstractions.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar