question

active

question:can-an-interpretable-symbolic-algorithm-be-used-to-faithfully-explain-a-complex-neural-network-model

Can an interpretable symbolic algorithm be used to faithfully explain a complex neural network model?

Framing question for the paper's research program.

Source paper

extracted_from

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

(2023) · Atticus Geiger · Zhengxuan Wu · Christopher Potts · Thomas Icard +1

Neighborhood — ranked by edge-count

Claims (1)

claim

DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard bases
gates
Central claim motivating DAS over prior methods.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Neural Network Interpretabilityconcept0.797
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Interpretability tools can reveal what 'feeling alive' looks like inside a neural network model.claim0.785
Sparse interpretability in neural networksconcept0.765
VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
"the 'code' these algorithms are implemented by is 'neural code', which is written, somehow, as enormous, inscrutable matrices of parameters"quote0.760
Load-bearing framing of the core interpretability problem: neural networks encode algorithms in parameter matrices rather than human-readable code.
Ultimately, we would like to understand neural networks well enough to be able to intentionally design them.quote0.755
Vision statement in the conclusion.
Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.quote0.750
The paper's central thesis statement, presented prominently after the abstract
Different neural network models trained on different objectives and modalities are converging to a shared statistical model of reality in their representation spaceshypothesis0.750
The central hypothesis of the paper; the platonic representation hypothesis itself
The feed-forward network truly implements a symbolic, tree-structured algorithm for hierarchical equality, with abstract equality relations not decomposable into input identitiesclaim0.750
DAS reveals that the neural network encodes abstract relational structure rather than raw input identities.