concept
active
concept:concept-direction-in-representation-spaceConcept Direction in Representation Space
A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Methods (1)
method
- MDS InjectionimplementsMean-difference vectors derived from self-statement activations (h_s); best-performing injection method in open-ended generation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- How a neural network encodes a semantic concept internally, argued to be better captured by manifolds than by atomic features.
- A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
- Central entity of Jackson's framework: a structure invented to give coherent account of immediate consequences of actions; the building block of software design
- Probabilistic framework formalizing concept-specific subspaces for targeted steering in generative models.
- The spatial/geometric organization of conceptual structure within neural network representations; central to the paper's thesis.
- A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
- Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
- Measure of similarity between the similarity structures (kernels) induced by two different representations