claim

active

claim:dictionary-learning-offers-advantages-over-linear-probes-amortization-of-cost-and-unsupervised-discovery-of-abstractions

Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.

SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (1)

finding

Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.
supports
Empirical comparison showing advantage of SAE features in low-data regime.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.779
Foundational SAE mechanistic interpretability paper
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.770
Motivation for using sparsity-based dictionary learning on language models
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlationsclaim0.765
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.763
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.760
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Cross-layer superposition is a fundamental challenge for dictionary learning.claim0.758
Features smeared across layers cannot be fully disentangled by SAE on a single residual stream.
Imagine reading a textbook with no figures or tables. Our ability to knowledge acquisition is greatly strengthened by jointly modeling diverse data modalities, such as vision, language, and audio.quote0.757
Load-bearing motivation for multimodal approach; frames the cognitive advantage of joint modalities.
Unsupervised learning builds a low-dimensional model of the input data.claim0.753
Clarifies what unsupervised learning does.