question

active

question:what-metrics-can-reliably-tell-us-if-dictionary-learning-has-successfully-extracted-high-quality-features

what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?

Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
gates
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.812
Motivation for using sparsity-based dictionary learning on language models
what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.811
Open question about whether there is a true discrete feature count or a continuous splitting process
as the subject model scales, how does the ideal expansion factor and required training data for dictionary learning change?question0.802
Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.794
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.claim0.780
Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable featuresfinding0.770
Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.claim0.763
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.747
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure