paper:doi-10-48550-arxiv-1910-10683Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Original abstract (expand)
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 76%
- ≈ 74%
- Transfer Learning for Speech Recognition on a BudgetLouis Kirsch, Ilia Kurenkov, Andreas Krug, Jens Johannsmeier and Sebastian Stober Julius Kunze2017≈ 71%
- ≈ 71%
- READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language ModelingXiaobao Wu, Xinshuai Dong, Khoi Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan Thong Nguyen2026≈ 71%
- Probing Word Translations in the Transformer and Trading Decoder for Encoder LayersHongfei Xu and Josef van Genabith and Qiuhui Liu and Deyi Xiong2021≈ 71%
- ≈ 70%
- ≈ 70%
- To transfer or not transfer: Unified transferability metric and analysisQianshan Zhan and Xiao-Jun Zeng2026≈ 70%
- Can Transformers Learn to Solve Problems Recursively?Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer Shizhuo Dylan Zhang2023≈ 70%
- Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical ReasoningWillem Zuidema, Claire E. Stevenson, Martha Lewis Philipp Hellwig2026≈ 69%
- Birth of a Transformer: A Memory ViewpointVivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou Alberto Bietti2023≈ 69%
- Transfer Learning for Improving Speech Emotion Classification AccuracyRajib Rana, Shahzad Younis, Junaid Qadir, and Julien Epps Siddique Latif2020≈ 69%
- Diversity of Transformer Layers: One Aspect of Parameter Scaling LawsYing Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe Hidetaka Kamigaito2025≈ 69%
- How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic InterpretabilityChangdae Oh, Zhen Fang, Sharon Li Shawn Im2026≈ 68%
- Relating transformers to models and neural representations of the hippocampal formationin corpus2021≈ 68%
- ≈ 67%
- A Mathematical Framework for Transformer Circuitsin corpus2021≈ 65%
- Learning without neurons in physical systemsin corpus2022≈ 63%
- Interpreting Language Model Parametersin corpus2026≈ 63%
- The Platonic Representation Hypothesisin corpus2024≈ 63%
- ≈ 62%
- ≈ 62%
- ≈ 62%
- Model Alignment Searchin corpus2025≈ 62%
- Simulators — LessWrongin corpus≈ 62%
- ≈ 62%
- ≈ 61%
Similar preprints — Semantic Scholar
Cited by (3)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- Multimodal Chain-of-Thought Reasoning in Language Models
Incorporating visual features into chain-of-thought rationale generation—rather than answer generation alone—breaks the hallucination bottleneck that causes sub-100B language models to fail at multimo
- Interpreting Language Model Parameters
VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping t