concept
active
concept:sparse-activation-spaces

Sparse Activation Spaces

Spaces of model activations from which sparse features are retrieved.

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Activation spaceconcept0.828
    Representation space on which linear probes operate to attribute harmful behaviors to training data.
  • Rich geometric structure carried by neural representations.
  • A linear combination of neurons in a layer; the general form of a neural network feature including both individual neurons and other combinations
  • Sparse Autoencoderframework0.743
    Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
  • Sparse circuitsconcept0.741
    A goal in mechanistic interpretability to identify sparse computational subgraphs; VPD promotes sparse parameter circuits.
  • Behavior Spaceconcept0.740
    A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
  • The extracted set of sparse interpretable features from model embeddings via SAEs
  • SAEs trained on 100M+ tokens to compress token layer-40 activations into 64 active features out of 100K+ for interpretability analysis