Linear representation

The idea that features are encoded as directions in activation space.

Neighborhood — ranked by edge-count

paper

thinker

Neel Nanda
studies
External commenter; resolved apparent counterexample to linear representation hypothesis

framework

Superposition Hypothesis
supports
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

method

Sparse Autoencoders (SAE)
implements
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

concept

Linear Representation of Features
related_to
The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space
Truth Direction
associated_with
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear Representation Hypothesisframework0.857
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Linear Representation of Concepts in LLMsconcept0.841
The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
Non-Linear Representation Dilemmaconcept0.809
Core contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features
logical representation of programsconcept0.808
The idea that programs can be expressed as logical sentences, enabling direct deductive verification.
linearityconcept0.807
The sequential, continuous order of text, often challenged by diagrammatic branching.
Non-Linear Representation Hypothesisconcept0.803
Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
Linear Decodingmethod0.802
Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
Linear Map (a ⊸ b)framework0.798
Semantic domain for linear transformations; denotation as actual linear function; Category instance generated from homomorphism principle.