concept
active
concept:concept-direction-in-representation-space

Concept Direction in Representation Space

A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Methods (1)

method
  • MDS Injection
    implements
    Mean-difference vectors derived from self-statement activations (h_s); best-performing injection method in open-ended generation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • How a neural network encodes a semantic concept internally, argued to be better captured by manifolds than by atomic features.
  • A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
  • Conceptconcept0.767
    Central entity of Jackson's framework: a structure invented to give coherent account of immediate consequences of actions; the building block of software design
  • Concept Algebraframework0.760
    Probabilistic framework formalizing concept-specific subspaces for targeted steering in generative models.
  • concept geometryconcept0.759
    The spatial/geometric organization of conceptual structure within neural network representations; central to the paper's thesis.
  • A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
  • Concept Steeringmethod0.754
    Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
  • Measure of similarity between the similarity structures (kernels) induced by two different representations