Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

ByColin Raffel·Noam Shazeer·Adam P. Roberts·Katherine Lee·Sharan Narang·Michael Matena+3 more

DOI 10.48550/arxiv.1910.10683 arXiv 1910.10683 OpenAlex W2981852735

Original abstract (expand)

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

An Introduction to Transformers
Richard E. Turner
2026
≈ 76%
Learning Transformer Programs
Alexander Wettig, Danqi Chen Dan Friedman
2023
≈ 74%
Transfer Learning for Speech Recognition on a Budget
Louis Kirsch, Ilia Kurenkov, Andreas Krug, Jens Johannsmeier and Sebastian Stober Julius Kunze
2017
≈ 71%
Transformers converge to invariant algorithmic cores
Joshua S. Schiffman
2026
≈ 71%
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
Xiaobao Wu, Xinshuai Dong, Khoi Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan Thong Nguyen
2026
≈ 71%
Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers
Hongfei Xu and Josef van Genabith and Qiuhui Liu and Deyi Xiong
2021
≈ 71%
Learning Rate Transfer in Normalized Transformers
Boris Hanin, Andrey Gromov Boris Shigida
2026
≈ 70%
Algorithmic Capabilities of Random Transformers
Jacob Andreas Ziqian Zhong
2024
≈ 70%
To transfer or not transfer: Unified transferability metric and analysis
Qianshan Zhan and Xiao-Jun Zeng
2026
≈ 70%
Can Transformers Learn to Solve Problems Recursively?
Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer Shizhuo Dylan Zhang
2023
≈ 70%
Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
Willem Zuidema, Claire E. Stevenson, Martha Lewis Philipp Hellwig
2026
≈ 69%
Birth of a Transformer: A Memory Viewpoint
Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou Alberto Bietti
2023
≈ 69%
Transfer Learning for Improving Speech Emotion Classification Accuracy
Rajib Rana, Shahzad Younis, Junaid Qadir, and Julien Epps Siddique Latif
2020
≈ 69%
Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws
Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe Hidetaka Kamigaito
2025
≈ 69%
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Changdae Oh, Zhen Fang, Sharon Li Shawn Im
2026
≈ 68%
Relating transformers to models and neural representations of the hippocampal formation
in corpus
2021
≈ 68%
Janus Information Flow Transformers 2025
in corpus
≈ 67%
A Mathematical Framework for Transformer Circuits
in corpus
2021
≈ 65%
Learning without neurons in physical systems
in corpus
2022
≈ 63%
Interpreting Language Model Parameters
in corpus
2026
≈ 63%
The Platonic Representation Hypothesis
in corpus
2024
≈ 63%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 62%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 62%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 62%
Model Alignment Search
in corpus
2025
≈ 62%
Simulators — LessWrong
in corpus
≈ 62%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 62%
Steering Along Manifolds to Control Neural Networks
in corpus
≈ 61%

Similar preprints — Semantic Scholar

Cited by (3)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
Multimodal Chain-of-Thought Reasoning in Language Models
Incorporating visual features into chain-of-thought rationale generation—rather than answer generation alone—breaks the hallucination bottleneck that causes sub-100B language models to fail at multimo
Interpreting Language Model Parameters
VPD (adVersarial Parameter Decomposition) decomposes weight matrices directly into rank-one interpretable subcomponents rather than decomposing activations as sparse autoencoders (SAEs) do, flipping t