Adversarial search for causally unimportant subcomponents

Procedure in VPD that actively searches for combinations that break the prediction of which subcomponents are unimportant, stress-testing the decomposition.

Neighborhood — ranked by edge-count

Papers (1)

paper

Interpreting Language Model Parameters
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

adversarial interactionconcept0.761
Competitive multi-agent setting with conflicting incentives and direct opposition via bidding and bluffing.
Rank-one subcomponentsconcept0.743
The atomic pieces into which weight matrices are decomposed by VPD; each rank-one component is interpretable.
Investigating the causal substructure of neural representations is necessary to avoid misidentifying data structures of simpler representations as abstract conceptsclaim0.743
Motivated by the finding that lexical entailment decomposes into word identities.
Adversarial ablationmethod0.736
Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
Causally Relevant Latent Subspaceconcept0.733
Contiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.
A good parameter subcomponent is causally important only for specific roles and can be removed from the model without hurting performance on irrelevant promptsclaim0.724
Definitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
Parameter subcomponentconcept0.723
One of the simple rank-one matrices resulting from VPD that sums with others to reconstruct the original model weights and has a specific functional role.
Causal abstraction analysismethod0.722
The formal method used to establish that the identified circuit causally mediates the model's cyclic reasoning behavior