question

active

question:as-the-subject-model-scales-how-does-the-ideal-expansion-factor-and-required-training-data-for-dictionary-learning-change

as the subject model scales, how does the ideal expansion factor and required training data for dictionary learning change?

Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.822
Motivation for using sparsity-based dictionary learning on language models
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.815
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.802
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.794
Open question about whether there is a true discrete feature count or a continuous splitting process
Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable featuresfinding0.787
Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.769
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
How does different post-training data shift a model's position along persona dimensions?question0.766
Future work direction: using persona space to study effects of training data on model character
What makes learning systems smart is that the parameters they adjust and the data to which they fit are not in the same space.claim0.764
Distillation of why learning generalises.