framework
active
framework:sparse-autoencoder-for-dictionary-learning

Sparse Autoencoder for Dictionary Learning

Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Neighborhood — ranked by edge-count

Thinkers (3)

thinker
  • Toy models of superposition.
  • Co-author of VPD paper; mechanistic interpretability researcher affiliated with Goodfire.
  • Co-author; de-risked residual-stream SAEs, ran feature completeness analysis.

Methods (1)

method
  • General method for finding overcomplete sparse decompositions; the paper uses sparse autoencoders as an approximation

Concepts (2)

concept
  • Pre-Encoder Bias
    associated_with
    Architectural modification subtracting a learned bias from autoencoder inputs before encoding; initialized to geometric median of dataset; improves autoencoder performance
  • Untied Decoder Weights
    associated_with
    Autoencoder design choice to learn separate encoder and decoder weights, increasing representational capacity by allowing encoder vectors to distinguish similar features

Claims (1)

claim

Frameworks (3)

framework
  • Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
  • Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
  • Mathematical framework enabling recovery of high-dimensional sparse vectors from low-dimensional projections; theoretically underpins sparse autoencoder approach

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.