concept
active
concept:linear-representation

Linear representation

The idea that features are encoded as directions in activation space.

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Neel Nanda
    studies
    External commenter; resolved apparent counterexample to linear representation hypothesis

Frameworks (1)

framework
  • Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

Methods (1)

method
  • Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Concepts (2)

concept
  • The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space
  • Truth Direction
    associated_with
    A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
  • The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
  • Core contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features
  • The idea that programs can be expressed as logical sentences, enabling direct deductive verification.
  • linearityconcept0.807
    The sequential, continuous order of text, often challenged by diagrammatic branching.
  • Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
  • Linear Decodingmethod0.802
    Correlative technique measuring the type of information encoded in distributed representations via linear predictability.
  • Linear Map (a ⊸ b)framework0.798
    Semantic domain for linear transformations; denotation as actual linear function; Category instance generated from homomorphism principle.