claim
active
claim:assuming-linear-representations-enables-identifying-the-location-of-certain-variables-in-a-dnn-but-many-insights-fail-to-generalise-when-more-powerful-non-linear-maps-are-usedAssuming linear representations enables identifying the location of certain variables in a DNN, but many insights fail to generalise when more powerful non-linear maps are used
Interpretive claim about what linear DAS results actually tell us
Source paper
extracted_from(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.815Foundation for interpreting features as linear directions.
- Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
- Corroborating result on additional task confirming main paper findings
- Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.
- Establishes that the observed linear structure is not merely a representation of text probability
- The idea that features are encoded as directions in activation space.
- Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.