Scaling laws can be used to guide the training of sparse autoencoders.

Compute-optimal hyperparameters follow predictable power-law relationships.

Source paper

extracted_from

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders produce interpretable features for large models.claim0.820
Central claim of the paper: the method scales to state-of-the-art transformers.
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.817
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.800
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.793
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remainhypothesis0.788
Forward-looking prediction about scalability of the method to larger models
Inverse Scaling Lawconcept0.785
Hypothesis cited in paper suggesting deceptive capabilities may scale with model size
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.784
Critique of activation-based interpretability methods.
Scaling laws analysis for SAE hyperparametersmethod0.777
Sweeping number of features and training steps to find compute-optimal SAE configurations.