method
active
method:model-editing-via-direct-subcomponent-overwriteModel editing via direct subcomponent overwrite
Technique to alter model behavior by directly editing a parameter subcomponent without training, demonstrated by changing an emoticon eye subcomponent.
Neighborhood — ranked by edge-count
Papers (1)
paper
- Interpreting Language Model Parametersintroduces
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Technique for modifying model knowledge or behavior via targeted interventions, e.g., ROME by Meng et al.
- Demonstrated that VPD-discovered subcomponents encode true computational machinery by enabling targeted, predictable behavior changes without gradient-based training.
- Application enabled by VPD: direct manipulation of weight matrices for interpretable model modification.
- Implicit question driving the editing experiment.
- Ability to surgically alter model behavior through direct parameter changes rather than activation interventions.
- Interpretive claim that the subcomponents correspond to real functional units.
- One of the simple rank-one matrices resulting from VPD that sums with others to reconstruct the original model weights and has a specific functional role.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.718In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.