concept
active
concept:sparse-activation-spacesSparse Activation Spaces
Spaces of model activations from which sparse features are retrieved.
Neighborhood — ranked by edge-count
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Representation space on which linear probes operate to attribute harmful behaviors to training data.
- Rich geometric structure carried by neural representations.
- A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- A goal in mechanistic interpretability to identify sparse computational subgraphs; VPD promotes sparse parameter circuits.
- A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
- The extracted set of sparse interpretable features from model embeddings via SAEs
- SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis