paper:doi-10-1145-3530811Efficient Transformers: A Survey
Original abstract (expand)
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision, and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed—Reformer, Linformer, Performer, Longformer, to name a few—which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency . With the aim of helping the avid researcher navigate this flurry, this article characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 71%
- ≈ 68%
- How Do Transformers "Do" Physics? Investigating the Simple Harmonic OscillatorZiming Liu, and Max Tegmark Subhash Kantamneni2024≈ 67%
- ≈ 67%
- ≈ 66%
- How Transformers Get Rich: Approximation and Dynamics AnalysisRuoxi Yu, Weinan E, Lei Wu Mingze Wang2025≈ 66%
- Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A SurveyYuhang Shi, Zukang Xu, Xi Chen Juan Zhong2026≈ 65%
- ≈ 65%
- Diversity of Transformer Layers: One Aspect of Parameter Scaling LawsYing Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe Hidetaka Kamigaito2025≈ 65%
- ≈ 65%
- What do Transformers Know about Government?Anisia Katinskaia, Lari Kotilainen, Sathianpong Trangcasanchai, Anh-Duc Vu, Roman Yangarber Jue Hou2024≈ 64%
- Can Transformers Do Enumerative Geometry?Roderic G. Corominas, Alessandro Giacchetto Baran Hashemi2025≈ 64%
- Can Transformers Learn to Solve Problems Recursively?Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer Shizhuo Dylan Zhang2023≈ 64%
- ≈ 64%
- Transformers Struggle to Learn to SearchSrushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He Abulhair Saparov2025≈ 64%
- A Mathematical Framework for Transformer Circuitsin corpus2021≈ 61%
- Relating transformers to models and neural representations of the hippocampal formationin corpus2021≈ 60%
- ≈ 60%
- ≈ 59%
- Simulators — LessWrongin corpus≈ 58%
- ≈ 58%
- Model Alignment Searchin corpus2025≈ 58%
- Interpreting Language Model Parametersin corpus2026≈ 57%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 57%
- ≈ 57%
- ≈ 56%
- ≈ 56%
- Technical Dimensions of Programming Systemsin corpus2023≈ 56%
Similar preprints — Semantic Scholar
Cited by (1)
- Interpreting Language Model Parameters
VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping t