Mechanistic editing through parameter surgical intervention

Direct modification of model subcomponents (MLPs, embeddings, unembedding vectors) to predictably alter outputs without retraining, using rank-one constraints.

4 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Paper Summary: Interpreting Language Model Parameters4 members

Bridges (2)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Few-shot anchoring & latent structure4 shared
Targeted neural network weight surgery2 shared

Findings (3)

Direct model editing via parameter subcomponent modification—emoticon eye recognition altered to predict shocked faces with no retrainingDemonstrated that VPD-discovered subcomponents encode true computational machinery by enabling targeted, predictable behavior changes without gradient-based training.
Editing the emoticon eye subcomponent to output the unembedding vector for 'o' causes the model to predict shocked faces for all emoticonsDirect parameter subcomponent overwrite produces a clean behavioral change without training.
Subcomponent L2.MLP.down:3382 (density 0.00%) predicts emoticon continuations after colon, semicolon, or equalsSpecific discovered subcomponent that activates on punctuation like ' :', ' ;', ' =', ':-' and predicts the rest of emoticons/emojis.

Claims (1)

Rank-one matrix decomposition constraint enforcing mechanistic simplicityCore design principle of VPD: each parameter subcomponent is constrained to be a simple rank-one matrix to enable isolated understanding and combination.