Concept Direction in Representation Space

A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods

Neighborhood — ranked by edge-count

framework

Linear Representation Hypothesis
cites
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

method

MDS Injection
implements
Mean-difference vectors derived from self-statement activations (h_s); best-performing injection method in open-ended generation

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

concept representationconcept0.820
How a neural network encodes a semantic concept internally, argued to be better captured by manifolds than by atomic features.
Direction (activation space)concept0.777
A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
Conceptconcept0.767
Central entity of Jackson's framework: a structure invented to give coherent account of immediate consequences of actions; the building block of software design
Concept Algebraframework0.760
Probabilistic framework formalizing concept-specific subspaces for targeted steering in generative models.
concept geometryconcept0.759
The spatial/geometric organization of conceptual structure within neural network representations; central to the paper's thesis.
Reflection directionconcept0.756
A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
Concept Steeringmethod0.754
Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
Representational Alignmentconcept0.749
Measure of similarity between the similarity structures (kernels) induced by two different representations