claim
active
claim:vpd-decomposes-parameters-not-activations-flipping-the-standard-sae-activation-patching-paradigmVPD decomposes parameters, not activations, flipping the standard SAE / activation-patching paradigm.
Core proposition of the paper: a substrate-level critique of existing interpretability methods.
Source paper
extracted_from(2026) · Bushnaq, Lucius · Braun, Dan · Clive-Griffin, Oliver · Bussmann, Bart +4
Neighborhood — ranked by edge-count
Methods (2)
method
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Activation patchingcitesStandard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core slogan encapsulating the paradigm shift of VPD.
- Applied capability claim: VPD enables surgical changes to model behaviour at the parameter level.
- Core interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
- Core methodological framework introduced in this paper; decomposes weight matrices into rank-one interpretable subcomponents using adversarial ablations.
- Empirical demonstration of VPD on a mid-scale transformer, establishing feasibility.
- Assertion about the qualitative advantages of VPD's rank-one decomposition.
- The VPD-based edit has similarly low off-target effects as uninterpretable fine-tuning methodsfinding0.763Performance comparison showing subcomponent editing is comparable to fine-tuning in preserving off-target behavior.