hypothesis
active
hypothesis:the-examples-of-features-found-in-language-models-suggest-they-are-highly-sparse-variables-consistent-with-dictionary-learning-being-applicableThe examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicable
Motivation for using sparsity-based dictionary learning on language models
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.830Core methodology paper for SAE-based interpretable feature extraction
- Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
- Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
- what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.812Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.801Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.801Open question about whether there is a true discrete feature count or a continuous splitting process