Non-Linear Representation Dilemma

Core contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features

Neighborhood — ranked by edge-count

Papers (1)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
introduces

Questions (2)

question

What is the connection between information encoding assumptions and causal abstraction?
gates
Identified as exciting future work direction
What should you do if you want to perform a causal analysis of your DNN?
associated_with
Practical question the paper attempts to answer in its conclusion

Concepts (1)

concept

Probing Complexity–Accuracy Trade-off
analogous_to
Longstanding debate from probing literature about whether complex probes reveal genuine encodings or just memorise; this paper revives it for causal abstraction

Findings (1)

finding

Theorem 1: Any algorithm is an input-restricted distributed abstraction of any DNN satisfying mild assumptions
supports
Central theoretical result proving unrestricted causal abstraction is trivial

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Non-Linear Representation Hypothesisconcept0.824
Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
Non-Linear Representations in LLMsconcept0.816
Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.
Linear representationconcept0.809
The idea that features are encoded as directions in activation space.
Natural Distribution of Representationsconcept0.768
The distribution of latent representations produced by the model under unperturbed inputs
Linear Representation Hypothesisframework0.765
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Linear Representation of Concepts in LLMsconcept0.748
The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
Representing non-linearly separable functions requires a network with multiple layers.claim0.746
Architectural requirement from machine learning.
Assuming linear representations enables identifying the location of certain variables in a DNN, but many insights fail to generalise when more powerful non-linear maps are usedclaim0.746
Interpretive claim about what linear DAS results actually tell us