concept
active
concept:bottom-up-interpretabilityBottom-up interpretability
An interpretability paradigm that explains computation in the model's own terms, rather than imposing top-down abstractions; VPD aims to realize this.
Neighborhood — ranked by edge-count
Claims (1)
claim
- VPD is a meaningful step toward bottom-up interpretabilityassociated_withPositioning of VPD as advancing the paradigm of explaining computation in the model's terms.
Methods (1)
method
- Core technique introduced in this paper for decomposing neural network weight matrices into mechanistically simple, interpretable rank-one subcomponents.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- VPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
- The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
- Sensory input providing evidence that is integrated with priors.
- Method using large language models (Claude) to generate and test explanations of features at scale
- Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature
- Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
- Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed