claim

active

claim:vpd-decomposes-parameters-not-activations-flipping-the-standard-sae-activation-patching-paradigm

VPD decomposes parameters, not activations, flipping the standard SAE / activation-patching paradigm.

Core proposition of the paper: a substrate-level critique of existing interpretability methods.

Source paper

extracted_from

(2026) · Bushnaq, Lucius · Braun, Dan · Clive-Griffin, Oliver · Bussmann, Bart +4

method

Sparse Autoencoders (SAE)
cites
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Activation patching
cites
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Decompose parameters, not activationsquote0.797
Core slogan encapsulating the paradigm shift of VPD.
VPD enables manual model editing through direct parameter manipulation.claim0.794
Applied capability claim: VPD enables surgical changes to model behaviour at the parameter level.
VPD subcomponents avoid feature splitting, improving interpretability over SAE approachclaim0.778
Core interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
VPD (adVersarial Parameter Decomposition)concept0.774
Core methodological framework introduced in this paper; decomposes weight matrices into rank-one interpretable subcomponents using adversarial ablations.
VPD scales to a 4-layer 67M-parameter model trained on The Pile.finding0.772
Empirical demonstration of VPD on a mid-scale transformer, establishing feasibility.
VPD subcomponents are sparse, interpretable, and avoid feature splitting.claim0.769
Assertion about the qualitative advantages of VPD's rank-one decomposition.
The VPD-based edit has similarly low off-target effects as uninterpretable fine-tuning methodsfinding0.763
Performance comparison showing subcomponent editing is comparable to fine-tuning in preserving off-target behavior.
Parameter Decomposition (vs Activation)concept0.763