Interpreting Language Model Parameters

ByLucius Bushnaq·Dan Braun·Oliver Clive-Griffin·Bart Bussmann·Nathan Hu·Michael Ivanitskiy+2 moreGoodfire

LLM Interpretability & Behavioral Analysis LLM interpretability & self-awareness LLM Introspection Neural Geometry Activation decomposition Adversarial Parameter Decomposition (VPD)The Pile Attention heads Adversarial search for causally unimportant subcomponents causal importance network Attribution graph construction Feature splitting Model editing via direct subcomponent overwrite Manual model editing Rank-one matrix decomposition Mechanistic faithfulness Sparse Autoencoders (SAE)Mechanistic Interpretability+8 more

TL;DR

VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping the standard mechanistic interpretability substrate. Applied to a 67M-parameter, 4-layer transformer trained on The Pile, VPD produces subcomponents that are sparse and interpretable while avoiding the feature-splitting artifacts characteristic of SAE-based approaches, and achieves a better sparsity-reconstruction tradeoff than transcoders — a competing parameter-level baseline. The adversarial ablation procedure that enforces mechanistic faithfulness is the method's core technical innovation: subcomponents must survive targeted ablation pressure, ensuring they correspond to genuine computational roles rather than statistical artifacts. Attention computations are shown to distribute across heads through parameter subcomponents with distinct, legible functional roles. Because VPD operates on weights rather than activations, it also enables manual model editing through direct parameter manipulation, a capability not available to activation-patching or SAE-steering pipelines. The paper argues this implies that the SAE program's foundational choice — treat activations as the unit of analysis — is not merely one option among many but an unnecessary constraint, and that parameter-level decomposition may be the more faithful path to understanding what a model has learned, with the frontier-scale generalizability of VPD left as the central open question.

What to take away

1. VPD (adVersarial Parameter Decomposition) decomposes weight matrices into rank-one subcomponents rather than decomposing activations, inverting the substrate assumption of SAE-based mechanistic interpretability.
2. The method was validated on a 67M-parameter, 4-layer transformer trained on The Pile, which is the largest model reported in this work.
3. VPD achieves a better sparsity-reconstruction tradeoff than transcoders, the nearest competing parameter-level decomposition approach.
4. Adversarial ablation — iteratively removing subcomponents and penalizing those whose absence does not degrade targeted computations — is the core training signal that enforces mechanistic faithfulness in VPD.
5. Attention computations in the 4-layer model distribute across heads via VPD subcomponents that carry distinct, human-interpretable functional roles, a finding not straightforwardly accessible through activation-level analysis.
6. VPD avoids feature splitting, a known failure mode of SAEs in which a single functional feature is fragmented across multiple dictionary elements.
7. Because subcomponents are defined in weight space, VPD enables manual model editing through direct parameter manipulation rather than the activation-space interventions used in feature steering.
8. To replicate the adversarial ablation protocol, a researcher would train decomposition on a frozen model, iteratively identify subcomponents whose ablation minimally affects targeted outputs, and use that signal as a faithfulness regularizer during decomposition optimization.
9. An open question the paper raises is whether VPD's sparsity and interpretability properties survive at frontier scale, given that all reported experiments are at 67M parameters and 4 layers.
10. The paper's critique of SAEs operates at the substrate level — arguing that decomposing weights rather than activations is a more principled unit of analysis — complementing geometry-level critiques that SAEs shatter activation manifolds.

Peer brief — for seminar discussion

Working at Goodfire in collaboration with MATS and independent researchers, Bushnaq et al. introduce VPD (adVersarial Parameter Decomposition), a method that decomposes a model's weight matrices into rank-one interpretable subcomponents rather than decomposing its activations as sparse autoencoders do. All experiments are conducted on a 67M-parameter, 4-layer transformer trained on The Pile. The central technical mechanism is an adversarial ablation procedure: during decomposition, subcomponents are subjected to targeted ablation pressure, and those whose removal does not degrade specific computations are penalized, forcing the retained components to correspond to genuine functional roles rather than statistical decomposition artifacts. This distinguishes VPD from transcoders, the closest prior art in parameter-level decomposition, against which VPD reports a better sparsity-reconstruction tradeoff. The load-bearing finding is that weight-space decomposition yields sparse, interpretable subcomponents without feature splitting, and that attention computations in the tested model distribute across heads through subcomponents with legible roles — a structure that activation-patching pipelines would not directly expose. Because subcomponents live in weight space, VPD also enables manual model editing through direct parameter manipulation, which is qualitatively different from activation steering or SAE feature clamping. The paper argues this implies that the SAE program's choice to treat activations as the unit of analysis is an unnecessary constraint, not a principled default, and predicts that parameter-level decomposition is a more faithful path to understanding learned computation. The central hypothesis left open is whether these properties survive at frontier scale; Goodfire's concurrent infrastructure work suggests this is their working direction. A critical reader would push back on the scale limitation most forcefully: a 67M-parameter, 4-layer model is architecturally atypical compared to the multi-billion-parameter, many-layer transformers where interpretability most matters, and rank-one decomposition of weight matrices may face combinatorial scaling challenges — in expressivity, in the number of required components, or in the adversarial ablation procedure's computational cost — that are simply not visible at 4 layers. The paper does not provide ablations across model sizes, so the sparsity-reconstruction advantage over transcoders cannot yet be attributed to anything other than the specific small-model regime tested. An alternative method the work could have benchmarked against more directly is activation patching with path patching specificity, which also aims at mechanistic faithfulness but operates on residual-stream activations rather than weights, and whose tradeoffs against VPD at matched model size remain uncharacterized.

Methods (6)

Adversarial Parameter Decomposition (VPD)
Core technique introduced in this paper for decomposing neural network weight matrices into mechanistically simple, interpretable rank-one subcomponents.
Adversarial search for causally unimportant subcomponents
Procedure in VPD that actively searches for combinations that break the prediction of which subcomponents are unimportant, stress-testing the decomposition.
Attribution graph construction
Method to trace how parameter subcomponents interact from input to output for a given next-token prediction, producing a subnetwork graph.
Model editing via direct subcomponent overwrite
Technique to alter model behavior by directly editing a parameter subcomponent without training, demonstrated by changing an emoticon eye subcomponent.
Rank-one matrix decomposition
Constraint in VPD where each parameter subcomponent is constrained to be a rank-one matrix for simplicity.
Sparse Autoencoders (SAE)
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Datasets (1)

The Pile
Training corpus used for the 67M-parameter model tested with VPD.

Findings (3)

Attention computations distribute across heads via parameter subcomponents with interpretable roles
Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
VPD achieves better sparsity-reconstruction tradeoff than transcoders on 67M model
Empirical result demonstrating VPD's efficiency advantage in parameter decomposition.
VPD scales to a 4-layer 67M-parameter model trained on The Pile.
Empirical demonstration of VPD on a mid-scale transformer, establishing feasibility.

Claims (6)

VPD decomposes parameters, not activations, flipping the standard SAE / activation-patching paradigm.
Core proposition of the paper: a substrate-level critique of existing interpretability methods.
VPD subcomponents avoid feature splitting, improving interpretability over SAE approach
Core interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
VPD subcomponents are sparse, interpretable, and avoid feature splitting.
Assertion about the qualitative advantages of VPD's rank-one decomposition.
VPD achieves a better sparsity-reconstruction tradeoff than transcoders.
Quantitative advantage claimed for VPD over a prior activation-decomposition method.
VPD enables manual model editing through direct parameter manipulation.
Applied capability claim: VPD enables surgical changes to model behaviour at the parameter level.
Adversarial ablations enforce mechanistic faithfulness.
Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.

Questions (1)

Does VPD mechanistic faithfulness and interpretability survive at frontier model scale?
Open research question about whether VPD generalizes beyond the tested 67M-parameter regime.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 94%
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu
2025
≈ 85%
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, Lee Sharkey Dan Braun
2025
≈ 84%
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Changye Li, Jiaming Ji, Yaodong Yang Hantao Lou
2025
≈ 84%
Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Mudith Jayasekara, Max Kirkby Charles O'Neill
2025
≈ 84%
Mechanistic Interpretability of Antibody Language Models Using SAEs
Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane Rebonto Haque
2026
≈ 84%
Supervised sparse auto-encoders for interpretable and compositional representations
Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao Ouns El Harzli
2026
≈ 84%
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
Ana Lucic Ege Erdogan
2025
≈ 84%
Insights into a radiology-specialised multimodal large language model with sparse autoencoders
Shruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland Kenza Bouzid
2025
≈ 83%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 83%
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Adam Davies, Marc E. Canby and Julia Hockenmaier Sewoong Lee
2025
≈ 83%
A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering
Jiaqing Xie
2025
≈ 83%
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Xuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu Dong Shu
2025
≈ 83%
Interpreting Attention Layer Outputs with Sparse Autoencoders
Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda Connor Kissane
2024
≈ 83%
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang Xiangchen Song
2025
≈ 83%
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda Patrick Leask and Bart Bussmann
2025
≈ 83%
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo
2025
≈ 83%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 83%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 80%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 80%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 79%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 79%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 78%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 78%
Anima Labs Phenomenology Pt1
in corpus
≈ 78%
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
cited
2020
≈ 78%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 78%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 78%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 78%
Efficient Transformers: A Survey
cited
2022
≈ 77%

+26 more

Similar preprints — Semantic Scholar

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

aboutblank_kb
Autoencoder Architectureframeworks/variational-autoencoder-architecture.md0.789