paper
referenced-only
2023
paper:bricken-towards-monosemanticity-decomposing-lang-2023

Towards monosemanticity: Decomposing language models with dictionary learning

Methods (5)

  • Activation Interval Sampling
    Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels
  • Attribution Similarity
    Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
  • Feature Interpretability Rubric
    14-point scoring rubric for human evaluation of feature interpretability covering confidence, activation consistency, logit consistency, and specificity
  • Masked Cosine Similarity
    Cosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
  • Neuron Resampling
    Periodically reinitializing dead autoencoder neurons using high-loss data points to improve feature coverage

Frameworks (2)

  • Disentanglement
    Related research agenda seeking representations that separate conceptually distinct factors; contrasted with superposition approach
  • Sparse Autoencoder for Dictionary Learning
    Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Datasets (2)

  • 8 Billion MLP Activation Samples
    Dataset of transformer MLP activations used to train sparse autoencoders; collected from 40M contexts
  • The Pile
    Training corpus used for the 67M-parameter model tested with VPD.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar

Cited by (4)