Bottom-up interpretability

An interpretability paradigm that explains computation in the model's own terms, rather than imposing top-down abstractions; VPD aims to realize this.

Neighborhood — ranked by edge-count

Claims (1)

claim

VPD is a meaningful step toward bottom-up interpretability
associated_with
Positioning of VPD as advancing the paradigm of explaining computation in the model's terms.

Methods (1)

method

Adversarial Parameter Decomposition (VPD)
implements
Core technique introduced in this paper for decomposing neural network weight matrices into mechanistically simple, interpretable rank-one subcomponents.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractionsclaim0.859
VPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
interpretabilityconcept0.814
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
Bottom-up Selectionconcept0.801
bottom-up sensory informationconcept0.792
Sensory input providing evidence that is integrated with priors.
Automated Interpretabilityframework0.790
Method using large language models (Claude) to generate and test explanations of features at scale
Interpretability Illusionconcept0.780
Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature
Interpretability as Natural Scienceframework0.774
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
Circuit Interpretabilityconcept0.773
Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed