concept
active
concept:sparse-feature-dictionary

Sparse Feature Dictionary

The extracted set of sparse interpretable features from model embeddings via SAEs

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • The central mechanistic interpretability tool applied across all three EEG transformers to extract sparse feature dictionaries

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation
  • Feature Sparsityconcept0.814
    Property that features activate on only a small fraction of inputs; enables compressed sensing and is what allows superposition to work
  • Mechanistic finding by Bricken et al. 2023 about how LLMs store features; cited as operational justification for pattern-repository assumption
  • Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
  • Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
  • Feature Densityconcept0.797
    Fraction of training tokens on which a given feature has nonzero activation; used as proxy metric for autoencoder quality
  • Coding scheme where qualities are represented by few neurons with continuous similarity relations.
  • Sparse Autoencoderframework0.773
    Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence