Linear Representation Hypothesis

The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Neighborhood — ranked by edge-count

Papers (5)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
contradictsextends
Model Alignment Search
introducesmentions
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
mentionssupports
Psychological Steering of Large Language Models
cites
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
implements

Thinkers (1)

thinker

Kiho Park
introduces
Formalized the Linear Representation Hypothesis in the ICML 2024 paper

Methods (6)

method

Activation Steering
implements
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Distributed Alignment Search
implements
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
1D Distributed Interchange Intervention (1D DII)
implements
Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space
Difference-in-Means
implements
Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
Linear Alignment Map (ϕ_lin)
implements
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Linear Probing
implements
Used to evaluate representation quality across VTAB tasks

Concepts (4)

concept

Non-Linear Representation Hypothesis
extends
Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
Non-Linear Representations in LLMs
contradicts
Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.
Privileged Bases Hypothesis
extends
Hypothesis that neurons form privileged bases to encode information; consistent with constructive abstraction
Concept Direction in Representation Space
cites
A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods

Claims (3)

claim

MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generation
supports
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.
extends
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
Given the linear representation hypothesis and binary linguistic features, 1D DII is sufficiently expressive for controlling model behaviour in CausalGym
extends
Theoretical justification for the methodological choice of 1D DII throughout the benchmark

Frameworks (3)

framework

Representation Engineering
cites
A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
Concept Cones
extends
The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
Intrinsic Self-Correction via Linear Representations
implements
Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear representationconcept0.857
The idea that features are encoded as directions in activation space.
Linear Representation of Featuresconcept0.821
The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space
Results collectively provide strong evidence that some version of the superposition hypothesis and linear representation hypothesis is trueclaim0.800
Authors' overall conclusion from number of interpretable features, activation-level correspondence to intensity, sensible logit weights, and interference weights
Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.793
Foundation for interpreting features as linear directions.
Linear Representation of Concepts in LLMsconcept0.782
The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.768
Interpretive synthesis of DIM and cone intervention successes
Non-Linear Representation Dilemmaconcept0.765
Core contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features
Assuming linear representations enables identifying the location of certain variables in a DNN, but many insights fail to generalise when more powerful non-linear maps are usedclaim0.762
Interpretive claim about what linear DAS results actually tell us