claim
active
claim:learned-features-reflect-the-functionality-of-the-model-and-not-just-the-data-distribution-as-evidenced-by-interpretable-downstream-effects-not-used-in-dictionary-learningLearned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learning
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (3)
finding
- Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable featuresassociated_withsupportsControls for dataset structure, showing trained model activations have richer structure than data distribution alone
- Causal validation of base64 feature function via pinned feature sampling
- Causal validation that the Arabic feature has the predicted downstream effect on generation
Methods (3)
method
- Logit Weight AnalysissupportsComputing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
- Clamping a feature's value to zero to measure its causal effect on model output.
- Pinned Feature SamplingsupportsSetting a feature's value to its maximum observed value and sampling from the model to validate causal interpretations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Motivation for using sparsity-based dictionary learning on language models
- Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
- what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.794Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
- what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.778Open question about whether there is a true discrete feature count or a continuous splitting process
- One of the updates about prosaic ML simulation.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.769In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- Describes scaffolding method and the model's meta-learning loop.