Linear Representation of Concepts in LLMs

The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams

Neighborhood — ranked by edge-count

paper

concept

Non-Linear Representations in LLMs
related_to
Recent work identifying cases where LLM features are not one-dimensionally linear, a caveat to the linearity hypothesis.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear representationconcept0.841
The idea that features are encoded as directions in activation space.
LLM Internal Representationsconcept0.836
High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.
Linear World Models in LLMsframework0.825
Prior work framework studying whether LLMs encode world models as linear structures in their representations
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.817
Establishes that the observed linear structure is not merely a representation of text probability
Linear Representation of Featuresconcept0.812
The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space
LLMs internalize deeply integrated representations of high-order concepts.claim0.807
The authors' interpretive assertion based on their steering results.
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputsclaim0.801
Interpretive claim connecting scale to abstraction level in LLM representations
Gender Representation in LLMsconcept0.796
The case study target in Section 4: localizing gender information in hidden representations of Pythia-6.9B