question
active
question:can-we-use-these-parameter-subcomponents-to-perform-clean-targeted-changescan we use these parameter subcomponents to perform clean, targeted changes?
Implicit question driving the editing experiment.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Direct parameter subcomponent overwrite produces a clean behavioral change without training.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretive claim that the subcomponents correspond to real functional units.
- One of the simple rank-one matrices resulting from VPD that sums with others to reconstruct the original model weights and has a specific functional role.
- Definitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
- First question posed after applying VPD, investigating whether the subcomponents make sense.
- Technique to alter model behavior by directly editing a parameter subcomponent without training, demonstrated by changing an emoticon eye subcomponent.
- Demonstrated that VPD-discovered subcomponents encode true computational machinery by enabling targeted, predictable behavior changes without gradient-based training.
- Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.736Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
- Emphasizes that anchoring is a binding process without weight updates.