concept
active
concept:direction-activation-spaceDirection (activation space)
A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Activation spacerelated_toRepresentation space on which linear probes operate to attribute harmful behaviors to training data.
- Feature (neural network)associated_withScalar function of the input corresponding to a direction in the vector space of neuron activations; claimed to be the fundamental unit of neural networks
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Rich geometric structure carried by neural representations.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods
- A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
- The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
- Spaces of model activations from which sparse features are retrieved.
- Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
- The low-dimensional geometric structure discovered in neural activation space; contrasted with linear/Euclidean geometry.