Sparse Activation Spaces

Spaces of model activations from which sparse features are retrieved.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation spaceconcept0.828
Representation space on which linear probes operate to attribute harmful behaviors to training data.
geometry of activation spaceconcept0.772
Rich geometric structure carried by neural representations.
Direction (activation space)concept0.758
A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
Sparse Autoencoderframework0.743
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Sparse circuitsconcept0.741
A goal in mechanistic interpretability to identify sparse computational subgraphs; VPD promotes sparse parameter circuits.
Behavior Spaceconcept0.740
A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
Sparse Feature Dictionaryconcept0.733
The extracted set of sparse interpretable features from model embeddings via SAEs
Sparse Autoencoder Training on Layer-40 Activationsmethod0.730
SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis