Efficient Transformers: A Survey

ByYi Tay·Mostafa Dehghani·Dara Bahri·Donald Metzler

DOI 10.1145/3530811 OpenAlex W3085139254

Original abstract (expand)

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision, and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed—Reformer, Linformer, Performer, Longformer, to name a few—which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency . With the aim of helping the avid researcher navigate this flurry, this article characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

An Introduction to Transformers
Richard E. Turner
2026
≈ 71%
Learning Transformer Programs
Alexander Wettig, Danqi Chen Dan Friedman
2023
≈ 68%
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator
Ziming Liu, and Max Tegmark Subhash Kantamneni
2024
≈ 67%
Scaling Efficient LLMs
B.N. Kausik
2026
≈ 67%
Transformers are Sample-Efficient World Models
Eloi Alonso, Fran\c{c}ois Fleuret Vincent Micheli
2023
≈ 66%
How Transformers Get Rich: Approximation and Dynamics Analysis
Ruoxi Yu, Weinan E, Lei Wu Mingze Wang
2025
≈ 66%
Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey
Yuhang Shi, Zukang Xu, Xi Chen Juan Zhong
2026
≈ 65%
Algorithmic Capabilities of Random Transformers
Jacob Andreas Ziqian Zhong
2024
≈ 65%
Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws
Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe Hidetaka Kamigaito
2025
≈ 65%
Transformers converge to invariant algorithmic cores
Joshua S. Schiffman
2026
≈ 65%
What do Transformers Know about Government?
Anisia Katinskaia, Lari Kotilainen, Sathianpong Trangcasanchai, Anh-Duc Vu, Roman Yangarber Jue Hou
2024
≈ 64%
Can Transformers Do Enumerative Geometry?
Roderic G. Corominas, Alessandro Giacchetto Baran Hashemi
2025
≈ 64%
Can Transformers Learn to Solve Problems Recursively?
Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer Shizhuo Dylan Zhang
2023
≈ 64%
A Discussion to Qualify Intelligence
Kieran Greer
2026
≈ 64%
Transformers Struggle to Learn to Search
Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He Abulhair Saparov
2025
≈ 64%
A Mathematical Framework for Transformer Circuits
in corpus
2021
≈ 61%
Relating transformers to models and neural representations of the hippocampal formation
in corpus
2021
≈ 60%
Janus Information Flow Transformers 2025
in corpus
≈ 60%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
in corpus
2024
≈ 59%
Simulators — LessWrong
in corpus
≈ 58%
Living Things Are Not (20th Century) Machines: Updating Mechanism Metaphors in Light of the Modern Science of Machine Behavior
in corpus
2021
≈ 58%
Model Alignment Search
in corpus
2025
≈ 58%
Interpreting Language Model Parameters
in corpus
2026
≈ 57%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 57%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 57%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 56%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 56%
Technical Dimensions of Programming Systems
in corpus
2023
≈ 56%

Similar preprints — Semantic Scholar

Cited by (1)

Interpreting Language Model Parameters
VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping t