claim

active

claim:assuming-linear-representations-enables-identifying-the-location-of-certain-variables-in-a-dnn-but-many-insights-fail-to-generalise-when-more-powerful-non-linear-maps-are-used

Assuming linear representations enables identifying the location of certain variables in a DNN, but many insights fail to generalise when more powerful non-linear maps are used

Interpretive claim about what linear DAS results actually tell us

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.815
Foundation for interpreting features as linear directions.
The fact that ϕ_lin tracks DNN performance more closely than ϕ_nonlin throughout training may support the linear representation hypothesis for IOI task featureshypothesis0.802
Authors' tentative hypothesis from Fig. 4 but they acknowledge they cannot formalise this intuition
Non-linear ϕ_nonlin achieves near-perfect IIA on distributive law task for both And-Or and And-Or-And algorithms, eliminating linear/identity map differencesfinding0.792
Corroborating result on additional task confirming main paper findings
Non-Linear Representations in LLMsconcept0.783
Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.780
Establishes that the observed linear structure is not merely a representation of text probability
Linear representationconcept0.780
The idea that features are encoded as directions in activation space.
Non-Linear Representation Hypothesisconcept0.779
Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.776
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.