claim
active
claim:results-collectively-provide-strong-evidence-that-some-version-of-the-superposition-hypothesis-and-linear-representation-hypothesis-is-trueResults collectively provide strong evidence that some version of the superposition hypothesis and linear representation hypothesis is true
Authors' overall conclusion from number of interpretable features, activation-level correspondence to intensity, sensible logit weights, and interference weights
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Superposition HypothesissupportsCore theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
- Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.776Explanation for why dictionary learning can recover many more features than dimensions.
- Interpretive synthesis of DIM and cone intervention successes
- Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
- Theoretical open question about the geometry of truth in LLMs raised in Discussion
- Claim about broader applicability of the scaling argument
- Future work direction identified in conclusion for enabling reliable truth assessment methods.
- Historical framing of how representation assumptions have evolved in causal interpretability