question

active

question:will-these-methods-work-for-large-models

will these methods work for large models?

Motivating question for the paper, addressed by scaling SAEs to Claude 3 Sonnet.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders produce interpretable features for large models.
answered_bygates
Central claim of the paper: the method scales to state-of-the-art transformers.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remainhypothesis0.791
Forward-looking prediction about scalability of the method to larger models
Features can be used to steer large models.claim0.783
Clamping feature activations causally alters model behavior in interpretable ways.
Does DAS scale with large foundation models?question0.762
Practical scalability question addressed in Appendix D.
Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)concept0.745
Survey of representation engineering methods cited as related work
Bigger models are more likely to converge to a shared representation than smaller modelshypothesis0.743
Selective pressure toward convergence via model capacity
Larger models should amplify bias less than smaller models, with model biases more accurately reflecting data biases rather than exacerbating themclaim0.737
Implication of PRH for AI fairness and bias
we have shown a mathematical relationship between the two modelsquote0.732
Core claim distinguishing this paper's contribution from looser representational similarity arguments.
Today's Large Language Models have become so good at playing Turing's game that it often takes experts to demonstrate the present limits of their ability to simulate human-like intelligence.claim0.729
Paper's assessment of current LLM capabilities relative to Turing Test