claim
active
claim:dictionary-learning-offers-advantages-over-linear-probes-amortization-of-cost-and-unsupervised-discovery-of-abstractionsDictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Empirical comparison showing advantage of SAE features in low-data regime.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.779Foundational SAE mechanistic interpretability paper
- Motivation for using sparsity-based dictionary learning on language models
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
- what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.763Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
- Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
- Features smeared across layers cannot be fully disentangled by SAE on a single residual stream.
- Load-bearing motivation for multimodal approach; frames the cognitive advantage of joint modalities.
- Clarifies what unsupervised learning does.