question

active

question:what-is-the-correct-number-of-features-for-dictionary-learning-and-is-this-question-well-posed

what is the 'correct number of features' for dictionary learning, and is this question well-posed?

Open question about whether there is a true discrete feature count or a continuous splitting process

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Feature splitting means dictionaries with fewer features provide coarser summaries of model features while larger dictionaries reveal finer-grained distinctions, with no uniquely 'correct' number of features
gates
Authors argue the absence of a fixed feature count is a property of the superposition geometry, not a failure of the method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.811
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.801
Motivation for using sparsity-based dictionary learning on language models
as the subject model scales, how does the ideal expansion factor and required training data for dictionary learning change?question0.794
Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.claim0.784
Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.778
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Optimal number of features scales faster than optimal number of training steps with compute budget.finding0.772
Allocation result from scaling laws.
Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable featuresfinding0.749
Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
How can we decide if our selection of examples is complete?question0.740
Central question motivating attribute exploration.