framework
active
framework:linear-representation-hypothesisLinear Representation Hypothesis
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Neighborhood — ranked by edge-count
Papers (5)
paper
- Model Alignment Searchintroducesmentions
Thinkers (1)
thinker
- Kiho ParkintroducesFormalized the Linear Representation Hypothesis in the ICML 2024 paper
Methods (6)
method
- Activation SteeringimplementsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- Distributed Alignment SearchimplementsThe core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
- Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space
- Difference-in-MeansimplementsMethod for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
- Linear Alignment Map (ϕ_lin)implementsAlignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
- Linear ProbingimplementsUsed to evaluate representation quality across VTAB tasks
Concepts (4)
concept
- Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
- Non-Linear Representations in LLMscontradictsRecent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.
- Privileged Bases HypothesisextendsHypothesis that neurons form privileged bases to encode information; consistent with constructive abstraction
- A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods
Claims (3)
claim
- Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- Theoretical justification for the methodological choice of 1D DII throughout the benchmark
Frameworks (3)
framework
- A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
- Concept ConesextendsThe central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
- Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The idea that features are encoded as directions in activation space.
- The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space
- Authors' overall conclusion from number of interpretable features, activation-level correspondence to intensity, sensible logit weights, and interference weights
- Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.793Foundation for interpreting features as linear directions.
- The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
- Interpretive synthesis of DIM and cone intervention successes
- Core contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features
- Interpretive claim about what linear DAS results actually tell us