framework
active
framework:transformer-architecturetransformer architecture
Neural network architecture based on attention, commonly used in large language models
Neighborhood — ranked by edge-count
Papers (2)
paper
Thinkers (1)
thinker
- Ashish VaswaniintroducesLead author of 'Attention is all you need', introducing the transformer architecture
Concepts (1)
concept
- Large Language Models (LLMs)implementsTransformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
Frameworks (3)
framework
- Main framework: uses scaling of free energy under domain wall formation to determine whether local interactions can sustain ordered phases based on graph topology alone
- Autoregressive modelsextendsSecond model system studied; used to show why flat autoregressive LLMs struggle with long-range coherence.
- AR(ω) modelimplementsStochastic process model predicting next token from a context window of length ω; mapped to local Hamiltonian
Artifacts (1)
artifact
- Simulators (LessWrong post)mentionsThe paper being extracted.
Findings (1)
finding
- Application to transformer language models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Base architecture of reasoning LLMs studied, with attention and MLP blocks per layer
- Prior Anthropic paper enabling circuit-level analysis of attention-only transformers; motivates current MLP decomposition
- Core abstraction in Fruit: pure function mapping signals to signals; enables compositional GUI definitions.
- Foundational mechanistic interpretability paper on transformer circuit analysis
- The varied neural network architectures used in the RL experiments to test whether the alignment phenomenon generalizes across architectures.
- Alexander's projected future architecture using ultramodern materials and process-based techniques to achieve living structure unlike 20th-century mechanical repetition.
- A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E