question
active
question:what-metrics-can-reliably-tell-us-if-dictionary-learning-has-successfully-extracted-high-quality-featureswhat metrics can reliably tell us if dictionary learning has successfully extracted high quality features?
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Motivation for using sparsity-based dictionary learning on language models
- what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.811Open question about whether there is a true discrete feature count or a continuous splitting process
- Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
- Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
- Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
- SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
- Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure