quote
active
quote:in-some-sense-this-is-the-simplest-language-model-we-profoundly-don-t-understand-and-so-it-makes-a-natural-target-for-our-paperIn some sense, this is the simplest language model we profoundly don't understand. And so it makes a natural target for our paper.
Articulates why a one-layer transformer with MLP is the appropriate starting target for mechanistic interpretability
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Opening sentence setting the stage for the importance of interpretability.
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- Claim about the limited utility of natural language surface features.
- Antra's earlier definitive statement of the tricameral model.
- Paper's assessment of current LLM capabilities relative to Turing Test
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.793Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Reflection on the eleven principles class, affirming that even a minimal form language can yield strong results.
- States that the form language delimits what buildings can be created.