paper:bushnaq-goodfire-vpd-parameters-2026Interpreting Language Model Parameters
TL;DR
VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping the standard mechanistic interpretability substrate. Applied to a 67M-parameter, 4-layer transformer trained on The Pile, VPD produces subcomponents that are sparse and interpretable while avoiding the feature-splitting artifacts characteristic of SAE-based approaches, and achieves a better sparsity-reconstruction tradeoff than transcoders — a competing parameter-level baseline. The adversarial ablation procedure that enforces mechanistic faithfulness is the method's core technical innovation: subcomponents must survive targeted ablation pressure, ensuring they correspond to genuine computational roles rather than statistical artifacts. Attention computations are shown to distribute across heads through parameter subcomponents with distinct, legible functional roles. Because VPD operates on weights rather than activations, it also enables manual model editing through direct parameter manipulation, a capability not available to activation-patching or SAE-steering pipelines. The paper argues this implies that the SAE program's foundational choice — treat activations as the unit of analysis — is not merely one option among many but an unnecessary constraint, and that parameter-level decomposition may be the more faithful path to understanding what a model has learned, with the frontier-scale generalizability of VPD left as the central open question.
What to take away
- 1. VPD (adVersarial Parameter Decomposition) decomposes weight matrices into rank-one subcomponents rather than decomposing activations, inverting the substrate assumption of SAE-based mechanistic interpretability.
- 2. The method was validated on a 67M-parameter, 4-layer transformer trained on The Pile, which is the largest model reported in this work.
- 3. VPD achieves a better sparsity-reconstruction tradeoff than transcoders, the nearest competing parameter-level decomposition approach.
- 4. Adversarial ablation — iteratively removing subcomponents and penalizing those whose absence does not degrade targeted computations — is the core training signal that enforces mechanistic faithfulness in VPD.
- 5. Attention computations in the 4-layer model distribute across heads via VPD subcomponents that carry distinct, human-interpretable functional roles, a finding not straightforwardly accessible through activation-level analysis.
- 6. VPD avoids feature splitting, a known failure mode of SAEs in which a single functional feature is fragmented across multiple dictionary elements.
- 7. Because subcomponents are defined in weight space, VPD enables manual model editing through direct parameter manipulation rather than the activation-space interventions used in feature steering.
- 8. To replicate the adversarial ablation protocol, a researcher would train decomposition on a frozen model, iteratively identify subcomponents whose ablation minimally affects targeted outputs, and use that signal as a faithfulness regularizer during decomposition optimization.
- 9. An open question the paper raises is whether VPD's sparsity and interpretability properties survive at frontier scale, given that all reported experiments are at 67M parameters and 4 layers.
- 10. The paper's critique of SAEs operates at the substrate level — arguing that decomposing weights rather than activations is a more principled unit of analysis — complementing geometry-level critiques that SAEs shatter activation manifolds.
Peer brief — for seminar discussion
Working at Goodfire in collaboration with MATS and independent researchers, Bushnaq et al. introduce VPD (adVersarial Parameter Decomposition), a method that decomposes a model's weight matrices into rank-one interpretable subcomponents rather than decomposing its activations as sparse autoencoders do. All experiments are conducted on a 67M-parameter, 4-layer transformer trained on The Pile. The central technical mechanism is an adversarial ablation procedure: during decomposition, subcomponents are subjected to targeted ablation pressure, and those whose removal does not degrade specific computations are penalized, forcing the retained components to correspond to genuine functional roles rather than statistical decomposition artifacts. This distinguishes VPD from transcoders, the closest prior art in parameter-level decomposition, against which VPD reports a better sparsity-reconstruction tradeoff. The load-bearing finding is that weight-space decomposition yields sparse, interpretable subcomponents without feature splitting, and that attention computations in the tested model distribute across heads through subcomponents with legible roles — a structure that activation-patching pipelines would not directly expose. Because subcomponents live in weight space, VPD also enables manual model editing through direct parameter manipulation, which is qualitatively different from activation steering or SAE feature clamping. The paper argues this implies that the SAE program's choice to treat activations as the unit of analysis is an unnecessary constraint, not a principled default, and predicts that parameter-level decomposition is a more faithful path to understanding learned computation. The central hypothesis left open is whether these properties survive at frontier scale; Goodfire's concurrent infrastructure work suggests this is their working direction. A critical reader would push back on the scale limitation most forcefully: a 67M-parameter, 4-layer model is architecturally atypical compared to the multi-billion-parameter, many-layer transformers where interpretability most matters, and rank-one decomposition of weight matrices may face combinatorial scaling challenges — in expressivity, in the number of required components, or in the adversarial ablation procedure's computational cost — that are simply not visible at 4 layers. The paper does not provide ablations across model sizes, so the sparsity-reconstruction advantage over transcoders cannot yet be attributed to anything other than the specific small-model regime tested. An alternative method the work could have benchmarked against more directly is activation patching with path patching specificity, which also aims at mechanistic faithfulness but operates on residual-stream activations rather than weights, and whose tradeoffs against VPD at matched model size remain uncharacterized.
Methods (6)
- Adversarial Parameter Decomposition (VPD)Core technique introduced in this paper for decomposing neural network weight matrices into mechanistically simple, interpretable rank-one subcomponents.
- Adversarial search for causally unimportant subcomponentsProcedure in VPD that actively searches for combinations that break the prediction of which subcomponents are unimportant, stress-testing the decomposition.
- Attribution graph constructionMethod to trace how parameter subcomponents interact from input to output for a given next-token prediction, producing a subnetwork graph.
- Model editing via direct subcomponent overwriteTechnique to alter model behavior by directly editing a parameter subcomponent without training, demonstrated by changing an emoticon eye subcomponent.
- Rank-one matrix decompositionConstraint in VPD where each parameter subcomponent is constrained to be a rank-one matrix for simplicity.
- Sparse Autoencoders (SAE)Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Datasets (1)
- The PileTraining corpus used for the 67M-parameter model tested with VPD.
Findings (3)
- Attention computations distribute across heads via parameter subcomponents with interpretable roles
Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
- VPD achieves better sparsity-reconstruction tradeoff than transcoders on 67M model
Empirical result demonstrating VPD's efficiency advantage in parameter decomposition.
- VPD scales to a 4-layer 67M-parameter model trained on The Pile.
Empirical demonstration of VPD on a mid-scale transformer, establishing feasibility.
Claims (6)
- VPD decomposes parameters, not activations, flipping the standard SAE / activation-patching paradigm.
Core proposition of the paper: a substrate-level critique of existing interpretability methods.
- VPD subcomponents avoid feature splitting, improving interpretability over SAE approach
Core interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
- VPD subcomponents are sparse, interpretable, and avoid feature splitting.
Assertion about the qualitative advantages of VPD's rank-one decomposition.
- VPD achieves a better sparsity-reconstruction tradeoff than transcoders.
Quantitative advantage claimed for VPD over a prior activation-decomposition method.
- VPD enables manual model editing through direct parameter manipulation.
Applied capability claim: VPD enables surgical changes to model behaviour at the parameter level.
- Adversarial ablations enforce mechanistic faithfulness.
Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.
Questions (1)
- Does VPD mechanistic faithfulness and interpretability survive at frontier model scale?
Open research question about whether VPD generalizes beyond the tested 67M-parameter regime.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 94%
- A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language ModelsXuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu2025≈ 85%
- Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter DecompositionLucius Bushnaq, Stefan Heimersheim, Jake Mendel, Lee Sharkey Dan Braun2025≈ 84%
- SAE-V: Interpreting Multimodal Models for Enhanced AlignmentChangye Li, Jiaming Ji, Yaodong Yang Hantao Lou2025≈ 84%
- Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse AutoencodersMudith Jayasekara, Max Kirkby Charles O'Neill2025≈ 84%
- Mechanistic Interpretability of Antibody Language Models Using SAEsOliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane Rebonto Haque2026≈ 84%
- Supervised sparse auto-encoders for interpretable and compositional representationsHugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao Ouns El Harzli2026≈ 84%
- Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse AutoencodersAna Lucic Ege Erdogan2025≈ 84%
- Insights into a radiology-specialised multimodal large language model with sparse autoencodersShruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland Kenza Bouzid2025≈ 83%
- ≈ 83%
- Evaluating and Designing Sparse Autoencoders by Approximating Quasi-OrthogonalityAdam Davies, Marc E. Canby and Julia Hockenmaier Sewoong Lee2025≈ 83%
- A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model SteeringJiaqing Xie2025≈ 83%
- Beyond Input Activations: Identifying Influential Latents by Gradient Sparse AutoencodersXuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu Dong Shu2025≈ 83%
- Interpreting Attention Layer Outputs with Sparse AutoencodersRobert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda Connor Kissane2024≈ 83%
- Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEsAashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang Xiangchen Song2025≈ 83%
- Sparse Autoencoders Do Not Find Canonical Units of AnalysisMichael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda Patrick Leask and Bart Bussmann2025≈ 83%
- Interpretable Steering of Large Language Models with Feature Guided Activation AdditionsChen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo2025≈ 83%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 83%
- ≈ 80%
- ≈ 80%
- ≈ 79%
- ≈ 79%
- ≈ 78%
- ≈ 78%
- Anima Labs Phenomenology Pt1in corpus≈ 78%
- ≈ 78%
- ≈ 78%
- ≈ 78%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 78%
- Efficient Transformers: A Surveycited2022≈ 77%
+26 more
Similar preprints — Semantic Scholar
Cross-corpus bridges (1)
same_concept_as · Nomic cosineExternal markdown files that talk about the same concept as this entity.
- aboutblank_kbAutoencoder Architectureframeworks/variational-autoencoder-architecture.md0.789