method
active
method:adversarial-search-for-causally-unimportant-subcomponentsAdversarial search for causally unimportant subcomponents
Procedure in VPD that actively searches for combinations that break the prediction of which subcomponents are unimportant, stress-testing the decomposition.
Neighborhood — ranked by edge-count
Papers (1)
paper
- Interpreting Language Model Parametersintroduces
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Competitive multi-agent setting with conflicting incentives and direct opposition via bidding and bluffing.
- The atomic pieces into which weight matrices are decomposed by VPD; each rank-one component is interpretable.
- Motivated by the finding that lexical entailment decomposes into word identities.
- Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
- Contiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.
- Definitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
- One of the simple rank-one matrices resulting from VPD that sums with others to reconstruct the original model weights and has a specific functional role.
- The formal method used to establish that the identified circuit causally mediates the model's cyclic reasoning behavior