framework
active
framework:linear-representation-hypothesis

Linear Representation Hypothesis

The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Kiho Park
    introduces
    Formalized the Linear Representation Hypothesis in the ICML 2024 paper

Methods (6)

method
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
  • The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
  • Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space
  • Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
  • Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
  • Linear Probing
    implements
    Used to evaluate representation quality across VTAB tasks

Concepts (4)

concept

Claims (3)

claim

Frameworks (3)

framework
  • A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
  • The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
  • Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.