finding
active
finding:dictionary-learning-on-model-with-randomly-shuffled-weights-produces-mainly-single-token-and-poorly-interpretable-featuresDictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable features
Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningassociated_withsupportsAuthors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Motivation for using sparsity-based dictionary learning on language models
- Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
- what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.770Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.765Foundational SAE mechanistic interpretability paper
- Theoretical hypothesis about the mechanism underlying LLM error detection and reflection.
- Strong test of the induction head hypothesis using uniformly sampled random tokens repeated three times
- Selective pressure toward convergence via task generality
- SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.