Direction (activation space)

A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations

Neighborhood — ranked by edge-count

concept

Activation space
related_to
Representation space on which linear probes operate to attribute harmful behaviors to training data.
Feature (neural network)
associated_with
Scalar function of the input corresponding to a direction in the vector space of neuron activations; claimed to be the fundamental unit of neural networks

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

geometry of activation spaceconcept0.807
Rich geometric structure carried by neural representations.
Activationsconcept0.785
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Concept Direction in Representation Spaceconcept0.777
A vector in activation space aligned with a behavioral concept; core object manipulated by RepE methods
Behavior Spaceconcept0.764
A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
Path-Based Activation Interventionmethod0.760
The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
Sparse Activation Spacesconcept0.758
Spaces of model activations from which sparse features are retrieved.
Activation Additionmethod0.757
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation Manifoldconcept0.751
The low-dimensional geometric structure discovered in neural activation space; contrasted with linear/Euclidean geometry.