concept
active
concept:direction-activation-space

Direction (activation space)

A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • Representation space on which linear probes operate to attribute harmful behaviors to training data.
  • Scalar function of the input corresponding to a direction in the vector space of neuron activations; claimed to be the fundamental unit of neural networks

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Rich geometric structure carried by neural representations.
  • Activationsconcept0.785
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods
  • Behavior Spaceconcept0.764
    A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
  • The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
  • Spaces of model activations from which sparse features are retrieved.
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • The low-dimensional geometric structure discovered in neural activation space; contrasted with linear/Euclidean geometry.