finding

active

finding:dictionary-learning-on-model-with-randomly-shuffled-weights-produces-mainly-single-token-and-poorly-interpretable-features

Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable features

Controls for dataset structure, showing trained model activations have richer structure than data distribution alone

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learning
associated_withsupports
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.818
Motivation for using sparsity-based dictionary learning on language models
as the subject model scales, how does the ideal expansion factor and required training data for dictionary learning change?question0.787
Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.770
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.765
Foundational SAE mechanistic interpretability paper
LLMs implicitly learn a distribution of 'consistent reasoning paths', and inconsistent reasoning forms statistical outliers with low probability under this distribution.hypothesis0.758
Theoretical hypothesis about the mechanism underlying LLM error detection and reflection.
Induction heads in two-layer models successfully perform in-context learning on completely random repeated token sequences far outside training distributionfinding0.757
Strong test of the induction head hypothesis using uniformly sampled random tokens repeated three times
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.755
Selective pressure toward convergence via task generality
Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.claim0.749
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.