Paper Summary: Interpreting Language Model Parameters

LLM Interpretability & Behavioral Analysis LLM interpretability & self-awareness Arora et al. (2025) work on interpretable neuron functional roles Off-target effects Parameter subcomponent Previous-token attention behavior Syntax-boundary routing behavior Weight space

TL;DR

Adversarial Parameter Decomposition (VPD) decomposes a language model's weight matrices directly into rank-one subcomponents, recovering ~10,000 interpretable parameter pieces across all 24 matrices of a 67M-parameter language model without supervision. Unlike sparse autoencoders, which operate on activations rather than weights, VPD uses gradient descent jointly over the subcomponents and an auxiliary causal importance network that predicts, per prompt, which subcomponents are causally necessary to reproduce the model's behavior. Applied to attention layers — historically the hardest target for mechanistic interpretability — VPD recovers algorithms distributed across multiple attention heads, including previous-token behavior and syntax-boundary routing, identifying them as pairs of query and key subcomponents acting in concert. The model-editing demonstration is directly diagnostic: locating a single subcomponent responsible for emoticon-eye recognition and replacing its output with the unembedding vector for 'o' produces a model that predicts all emoticons as shocked faces, with off-target effects comparable to fine-tuning. Attribution graphs over the recovered subnetwork for a gendered-pronoun prediction task reveal two interpretable pathways — a femaleness signal propagated from 'princess' and a verb-triggered object-pronoun upweighting from 'lost'. The paper argues this constitutes evidence that language model parameters are not irreducibly complex, and that bottom-up interpretability grounded in the model's own computational structure, rather than human-imposed abstractions, is achievable and scalable.

What to take away

1. VPD (Adversarial Parameter Decomposition) decomposes all 24 weight matrices of a 67M-parameter language model into approximately 10,000 rank-one subcomponents via unsupervised gradient descent.
2. Each rank-one subcomponent is constrained to sum exactly to the original weight matrix it decomposes, ensuring the decomposition is a faithful partition of the model's parameters rather than an approximation.
3. An auxiliary causal importance network is trained jointly with the subcomponents to predict, per prompt, the minimum subset of subcomponents causally necessary to reproduce the target model's output.
4. Adversarial stress-testing actively searches for prompt combinations that break the causal importance network's predictions of which subcomponents are irrelevant, hardening the reliability of the causal labels.
5. VPD recovers attention algorithms distributed across multiple attention heads — including previous-token behavior and syntax-boundary routing — as pairs of query and key subcomponents, a capability that prior mechanistic interpretability methods could not achieve automatically.
6. A single identified subcomponent handling emoticon-eye recognition (e.g., ';', ':', '=') was edited by replacing its output with the model's unembedding vector for 'o', producing a model predicting all emoticons as shocked faces with off-target effects comparable to fine-tuning.
7. The subcomponent at index L2.MLP.down:3382 activates at 0.00% density but reliably fires on punctuation characters such as ':', ';', and '=' that precede text-based emoticons, illustrating extreme specialization within the decomposition.
8. An attribution graph over the subnetwork for predicting 'her' in 'the princess lost her crown' reveals two interpretable pathways: one routing a femaleness signal from 'princess' via attention and one upweighting object pronouns from the verb 'lost'.
9. An open question the paper raises is whether VPD's rank-one constraint and causal importance framework will remain sufficient to recover interpretable structure when scaled to models with billions rather than millions of parameters.
10. To replicate the decomposition pipeline, a researcher would jointly train rank-one subcomponents summing to original weights and an auxiliary causal importance network using gradient descent, then adversarially perturb causally-unimportant subcomponents to stress-test the selection.

Peer brief — for seminar discussion

Goodfire's VPD paper attacks a problem that activation-centric interpretability sidesteps entirely: what computational structure actually lives inside a language model's weight matrices, as opposed to in its runtime activations. The method introduced — Adversarial Parameter Decomposition (VPD) — decomposes each weight matrix into rank-one subcomponents that must sum exactly to the original matrix, are trained via gradient descent to be causally important as rarely as possible, and are jointly optimized with an auxiliary causal importance network that predicts per-prompt which subcomponents are necessary. An adversarial loop then stress-tests those causal labels by searching for prompt combinations that expose false negatives. The entire pipeline is applied to a 67M-parameter language model, decomposing all 24 weight matrices into roughly 10,000 subcomponents. The load-bearing finding is that these subcomponents are interpretable and functionally specific. The emoticon-eye subcomponent (L2.MLP.down:3382, activating at 0.00% density) fires exclusively on punctuation like ':', ';', and '=' that precede emoticons. Editing this single subcomponent to output the unembedding vector for 'o' shifts the model to predict shocked-face emoticons universally, with off-target collateral damage comparable to fine-tuning — a direct demonstration that VPD isolates real computational machinery rather than epiphenomenal structure. On attention layers, VPD recovers previous-token and syntax-boundary-routing algorithms distributed across multiple heads, something no prior automatic method could do. Attribution graphs for the pronoun prediction 'the princess lost her crown' expose two interpretable pathways in the recovered 10,000-component subnetwork. The implication is that parameter space contains latent, decomposable algorithmic structure, and that bottom-up interpretability — explaining computation in the network's own terms — is tractable. The authors predict future versions of VPD will scale to larger models while remaining architecturally identical in spirit, and frame the long-run goal as understanding networks well enough to design them intentionally. A critical reader would push back on the rank-one constraint as a strong architectural prior. Requiring subcomponents to be rank-one guarantees mechanistic simplicity by construction, but it is not obvious that the model's true functional units respect this constraint; the decomposition may be recovering interpretable projections of higher-rank structures rather than the structures themselves. An alternative approach — using sparse dictionary learning directly on weight matrices rather than on activations, without the rank-one restriction — could test whether the interpretability results survive relaxing that assumption. The paper also demonstrates editing and attribution only on one 67M-parameter model trained on IRC/forum text, leaving open whether the ~10,000-component count or the semantic specificity of components generalizes to instruction-tuned or RLHF-trained models at the scale where safety-relevant behaviors actually reside.

Findings (12)

Subnetwork for predicting 'her' vs 'his' in 'the princess lost her crown' involves femaleness signal routing via attention and syntactic role detection
Detailed case study demonstrating how VPD subnetworks can be traced to reveal multiple interpretable computational pathways for a single prediction.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)
Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Editing the emoticon eye subcomponent to output the unembedding vector for 'o' causes the model to predict shocked faces for all emoticons
Direct parameter subcomponent overwrite produces a clean behavioral change without training.
Direct model editing via parameter subcomponent modification—emoticon eye recognition altered to predict shocked faces with no retraining
Demonstrated that VPD-discovered subcomponents encode true computational machinery by enabling targeted, predictable behavior changes without gradient-based training.
Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attention
One component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Subcomponent L2.MLP.down:3382 (density 0.00%) predicts emoticon continuations after colon, semicolon, or equals
Specific discovered subcomponent that activates on punctuation like ' :', ' ;', ' =', ':-' and predicts the rest of emoticons/emojis.
A pair of query and key subcomponents distributed across attention heads performs previous-token behavior
VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routing
VPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.
Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponents
Quantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.
Identification of algorithms implemented in attention layers, distributed across attention heads
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.

Claims (16)

Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insight
Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
A good parameter subcomponent is causally important only for specific roles and can be removed from the model without hurting performance on irrelevant prompts
Definitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
Neurons can correspond to interpretable functional roles but interpretations in terms of individual neurons are unlikely to be the most parsimonious
Claim from footnote 3, acknowledging neuron-level interpretability while arguing subcomponents are better.
The field of interpretability has focused mainly on understanding model activations, not the computations themselves
Motivation for VPD's parameter-focused approach.
Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractions
VPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parameters
Critique of activation-based interpretability methods.
The ability to make precise edits demonstrates that VPD identifies real computational machinery
Claim that editing success validates VPD's decomposition.
Language models implement algorithms humans have tried and failed to write by hand for decades
Opening interpretive claim about the remarkable nature of language models.
Rank-one matrix decomposition constraint enforcing mechanistic simplicity
Core design principle of VPD: each parameter subcomponent is constrained to be a simple rank-one matrix to enable isolated understanding and combination.
VPD identifies real, computational structure in neural network parameters
Central claim that VPD successfully uncovers genuine mechanisms.

Hypotheses (1)

Language models contain interpretable computational structure encoded in their parameter weights, not irreducibly impenetrable complexity
Core empirical hypothesis of the paper, supported by successful VPD decomposition yielding ~10,000 interpretable subcomponents across 24 weight matrices.

Questions (3)

How can mechanistic interpretability methods automatically identify attention computations that span multiple attention heads?
Long-standing bottleneck in mechanistic interpretability that VPD addresses by working natively on attention weight matrices.
can we use these parameter subcomponents to perform clean, targeted changes?
Implicit question driving the editing experiment.
are the resulting parameter subcomponents actually interpretable objects?
First question posed after applying VPD, investigating whether the subcomponents make sense.

Original abstract (expand)

This paper introduces Adversarial Parameter Decomposition (VPD), a technique for interpreting language model parameters by decomposing weight matrices into simple, understandable rank-one components. By splitting the model's parameters into interpretable pieces, the authors can identify algorithms implemented in attention layers, edit model behavior without retraining, and recover small subnetworks responsible for specific behaviors. The approach represents a bottom-up form of interpretability that explains computation in the model's own terms rather than imposing external abstractions.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Interpreting Language Model Parameters
in corpus
2026
≈ 94%
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, Lee Sharkey Dan Braun
2025
≈ 85%
Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness
Shu Yang, Lijie Hu, Di Wang Zhipeng Yang
2026
≈ 83%
Visual Representations inside the Language Model
Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna Benlin Liu
2025
≈ 82%
Interpreting Language Models Through Concept Descriptions: A Survey
Laura Kopf Nils Feldhus
2026
≈ 82%
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability
Alejandro Mat\'e, Juan Trujillo Jorge Garc\'ia-Carrasco
2024
≈ 82%
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin Sayed Mohammad Vakilzadeh Hatefi
2026
≈ 82%
Stochastic Parameter Decomposition
Dan Braun, Lee Sharkey Lucius Bushnaq
2025
≈ 82%
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Zubair Bashir, Procheta Sen Bhavik Chandna
2025
≈ 82%
Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)
Sangyu Han, Sangbum Han, Nojun Kwak Yearim Kim
2024
≈ 82%
Do Language Models Encode Knowledge of Linguistic Constraint Violations?
Sebastian Pad\'o Hardy
2026
≈ 82%
ELROND: Exploring and decomposing intrinsic capabilities of diffusion models
Tomasz Trzci\'nski, Kamil Deja Pawe{\l} Skier\'s
2026
≈ 81%
Partially Rewriting a Transformer in Natural Language
Nora Belrose Gon\c{c}alo Paulo
2025
≈ 81%
Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models
Veniamin Veselovsky, R. Thomas McCoy, Thomas L. Griffiths Liyi Zhang
2025
≈ 81%
Monitoring Latent World States in Language Models with Propositional Probes
Stuart Russell, Jacob Steinhardt Jiahai Feng
2024
≈ 81%
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Boussad Addad, Katarzyna Kapusta Thomas Winninger
2025
≈ 81%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 81%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 80%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 80%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 80%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 79%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 79%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 79%
Anima Labs Phenomenology Pt1
in corpus
≈ 78%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 78%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 78%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 78%
Model Alignment Search
in corpus
2025
≈ 78%