paper
active
2026
804
paper:bushnaq-goodfire-vpd-parameters-2026

Interpreting Language Model Parameters

TL;DR

VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping the standard mechanistic interpretability substrate. Applied to a 67M-parameter, 4-layer transformer trained on The Pile, VPD produces subcomponents that are sparse and interpretable while avoiding the feature-splitting artifacts characteristic of SAE-based approaches, and achieves a better sparsity-reconstruction tradeoff than transcoders — a competing parameter-level baseline. The adversarial ablation procedure that enforces mechanistic faithfulness is the method's core technical innovation: subcomponents must survive targeted ablation pressure, ensuring they correspond to genuine computational roles rather than statistical artifacts. Attention computations are shown to distribute across heads through parameter subcomponents with distinct, legible functional roles. Because VPD operates on weights rather than activations, it also enables manual model editing through direct parameter manipulation, a capability not available to activation-patching or SAE-steering pipelines. The paper argues this implies that the SAE program's foundational choice — treat activations as the unit of analysis — is not merely one option among many but an unnecessary constraint, and that parameter-level decomposition may be the more faithful path to understanding what a model has learned, with the frontier-scale generalizability of VPD left as the central open question.

What to take away

  1. 1. VPD (adVersarial Parameter Decomposition) decomposes weight matrices into rank-one subcomponents rather than decomposing activations, inverting the substrate assumption of SAE-based mechanistic interpretability.
  2. 2. The method was validated on a 67M-parameter, 4-layer transformer trained on The Pile, which is the largest model reported in this work.
  3. 3. VPD achieves a better sparsity-reconstruction tradeoff than transcoders, the nearest competing parameter-level decomposition approach.
  4. 4. Adversarial ablation — iteratively removing subcomponents and penalizing those whose absence does not degrade targeted computations — is the core training signal that enforces mechanistic faithfulness in VPD.
  5. 5. Attention computations in the 4-layer model distribute across heads via VPD subcomponents that carry distinct, human-interpretable functional roles, a finding not straightforwardly accessible through activation-level analysis.
  6. 6. VPD avoids feature splitting, a known failure mode of SAEs in which a single functional feature is fragmented across multiple dictionary elements.
  7. 7. Because subcomponents are defined in weight space, VPD enables manual model editing through direct parameter manipulation rather than the activation-space interventions used in feature steering.
  8. 8. To replicate the adversarial ablation protocol, a researcher would train decomposition on a frozen model, iteratively identify subcomponents whose ablation minimally affects targeted outputs, and use that signal as a faithfulness regularizer during decomposition optimization.
  9. 9. An open question the paper raises is whether VPD's sparsity and interpretability properties survive at frontier scale, given that all reported experiments are at 67M parameters and 4 layers.
  10. 10. The paper's critique of SAEs operates at the substrate level — arguing that decomposing weights rather than activations is a more principled unit of analysis — complementing geometry-level critiques that SAEs shatter activation manifolds.

Peer brief — for seminar discussion

Working at Goodfire in collaboration with MATS and independent researchers, Bushnaq et al. introduce VPD (adVersarial Parameter Decomposition), a method that decomposes a model's weight matrices into rank-one interpretable subcomponents rather than decomposing its activations as sparse autoencoders do. All experiments are conducted on a 67M-parameter, 4-layer transformer trained on The Pile. The central technical mechanism is an adversarial ablation procedure: during decomposition, subcomponents are subjected to targeted ablation pressure, and those whose removal does not degrade specific computations are penalized, forcing the retained components to correspond to genuine functional roles rather than statistical decomposition artifacts. This distinguishes VPD from transcoders, the closest prior art in parameter-level decomposition, against which VPD reports a better sparsity-reconstruction tradeoff. The load-bearing finding is that weight-space decomposition yields sparse, interpretable subcomponents without feature splitting, and that attention computations in the tested model distribute across heads through subcomponents with legible roles — a structure that activation-patching pipelines would not directly expose. Because subcomponents live in weight space, VPD also enables manual model editing through direct parameter manipulation, which is qualitatively different from activation steering or SAE feature clamping. The paper argues this implies that the SAE program's choice to treat activations as the unit of analysis is an unnecessary constraint, not a principled default, and predicts that parameter-level decomposition is a more faithful path to understanding learned computation. The central hypothesis left open is whether these properties survive at frontier scale; Goodfire's concurrent infrastructure work suggests this is their working direction. A critical reader would push back on the scale limitation most forcefully: a 67M-parameter, 4-layer model is architecturally atypical compared to the multi-billion-parameter, many-layer transformers where interpretability most matters, and rank-one decomposition of weight matrices may face combinatorial scaling challenges — in expressivity, in the number of required components, or in the adversarial ablation procedure's computational cost — that are simply not visible at 4 layers. The paper does not provide ablations across model sizes, so the sparsity-reconstruction advantage over transcoders cannot yet be attributed to anything other than the specific small-model regime tested. An alternative method the work could have benchmarked against more directly is activation patching with path patching specificity, which also aims at mechanistic faithfulness but operates on residual-stream activations rather than weights, and whose tradeoffs against VPD at matched model size remain uncharacterized.

Methods (6)

Datasets (1)

  • The Pile
    Training corpus used for the 67M-parameter model tested with VPD.

Findings (3)

Claims (6)

Questions (1)

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+26 more

Similar preprints — Semantic Scholar

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

  • aboutblank_kb
    Autoencoder Architectureframeworks/variational-autoencoder-architecture.md0.789