hypothesis

active

hypothesis:the-examples-of-features-found-in-language-models-suggest-they-are-highly-sparse-variables-consistent-with-dictionary-learning-being-applicable

The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicable

Motivation for using sparsity-based dictionary learning on language models

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.842
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.830
Core methodology paper for SAE-based interpretable feature extraction
as the subject model scales, how does the ideal expansion factor and required training data for dictionary learning change?question0.822
Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable featuresfinding0.818
Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.812
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.808
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.801
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.801
Open question about whether there is a true discrete feature count or a continuous splitting process