question

active

question:are-the-resulting-parameter-subcomponents-actually-interpretable-objects

are the resulting parameter subcomponents actually interpretable objects?

First question posed after applying VPD, investigating whether the subcomponents make sense.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Findings (1)

finding

Subcomponent L2.MLP.down:3382 (density 0.00%) predicts emoticon continuations after colon, semicolon, or equals
answered_by
Specific discovered subcomponent that activates on punctuation like ' :', ' ;', ' =', ':-' and predicts the rest of emoticons/emojis.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Parameter subcomponentconcept0.788
One of the simple rank-one matrices resulting from VPD that sums with others to reconstruct the original model weights and has a specific functional role.
Attention computations distribute across heads via parameter subcomponents with interpretable rolesfinding0.783
Mechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
Parameter subcomponents cleanly isolate true mechanisms of the modelclaim0.777
Interpretive claim that the subcomponents correspond to real functional units.
VPD subcomponents are sparse, interpretable, and avoid feature splitting.claim0.763
Assertion about the qualitative advantages of VPD's rank-one decomposition.
A good parameter subcomponent is causally important only for specific roles and can be removed from the model without hurting performance on irrelevant promptsclaim0.761
Definitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
can we use these parameter subcomponents to perform clean, targeted changes?question0.756
Implicit question driving the editing experiment.
Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightclaim0.750
Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Interpretable predictions can help resolve variants of uncertain significanceclaim0.746
Motivating claim that mechanistic explanations add clinical value for VUS.