claim
active
claim:individual-floating-point-number-weights-in-neural-networks-become-meaningful-once-you-understand-the-features-they-connectIndividual floating-point number weights in neural networks become meaningful once you understand the features they connect.
Interpretive claim that circuits render raw weights interpretable as algorithmic structures
Source paper
extracted_from(2020) · Chris Olah · Nick Cammarata · Ludwig Schubert · Gabriel Goh +2
Neighborhood — ranked by edge-count
Findings (1)
finding
- Demonstrates that meaningful algorithms can be read directly off floating-point weights in a neural network
Quotes (1)
quote
- Load-bearing claim about the tractability of circuit analysis; central thesis of Claim 2
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central motivating question for the circuits research program
- Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.787Explanation for why dictionary learning can recover many more features than dimensions.
- Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.783Foundation for interpreting features as linear directions.
- Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
- Primary empirical claim of the paper
- Vision statement in the conclusion.
- The paper's central thesis statement, presented prominently after the abstract