paper:doi-10-48550-arxiv-2505-20633Test-Time Learning for Large Language Models
Original abstract (expand)
While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- A Survey of Large Language ModelsKun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Wayne Xin Zhao2026≈ 78%
- Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated ProbabilitiesSathvik Nair and Colin Phillips2026≈ 77%
- What do Language Models Learn and When? The Implicit Curriculum HypothesisKaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja, Jen-tse Huang, Graham Neubig Emmy Liu2026≈ 76%
- ≈ 76%
- Specialization after Generalization: Towards Understanding Test-Time Training in Foundation ModelsPatrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur Jonas H\"ubotter2026≈ 75%
- Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even PerformanceIvan Srba, Maria Bielikova Branislav Pecher2026≈ 75%
- An Evaluation on Large Language Model Outputs: Discourse and MemorizationXun Wang, Alex Sokolov, Qilong Gu and Si-Qing Chen Adrian de Wynter2026≈ 75%
- ≈ 75%
- Advancing the Scientific Method with Large Language Models: From Hypothesis to DiscoverySumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, Hector Zenil Yanbo Zhang2025≈ 75%
- When Do You Need Billions of Words of Pretraining Data?Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman Yian Zhang2020≈ 75%
- Getting More from Less: Large Language Models are Good Spontaneous Multilingual LearnersChangjiang Gao, Wenhao Zhu, Jiajun Chen, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang Shimao Zhang2024≈ 75%
- ≈ 75%
- ≈ 74%
- A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety RisksHieu Minh "Jord" Nguyen2025≈ 74%
- Evaluating Large Language Models with PsychometricsYue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun Yuan Li2025≈ 74%
- ≈ 68%
- ≈ 68%
- ≈ 68%
- Model Alignment Searchin corpus2025≈ 68%
- ≈ 67%
- ≈ 67%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 67%
- Interpreting Language Model Parametersin corpus2026≈ 67%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 67%
- Learning without neurons in physical systemsin corpus2022≈ 67%
- ≈ 66%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 66%
- A Mathematical Framework for Transformer Circuitsin corpus2021≈ 66%
Similar preprints — Semantic Scholar
Cited by (1)
- The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
Semantic anchoring — the binding of a pretrained model's latent patterns to task-specific targets via external structure — predicts threshold-like performance flips with a single calibrated score S =