Test-Time Learning for Large Language Models

ByJinwu Hu·Z. Zhang·Guohao Chen·Xutao Wen·Chao Shuai·Wei Luo+3 more

DOI 10.48550/arxiv.2505.20633 arXiv 2505.20633

Original abstract (expand)

While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

A Survey of Large Language Models
Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Wayne Xin Zhao
2026
≈ 78%
Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities
Sathvik Nair and Colin Phillips
2026
≈ 77%
What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja, Jen-tse Huang, Graham Neubig Emmy Liu
2026
≈ 76%
Evaluating Large Language Models in Theory of Mind Tasks
Michal Kosinski
2024
≈ 76%
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
Patrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur Jonas H\"ubotter
2026
≈ 75%
Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
Ivan Srba, Maria Bielikova Branislav Pecher
2026
≈ 75%
An Evaluation on Large Language Model Outputs: Discourse and Memorization
Xun Wang, Alex Sokolov, Qilong Gu and Si-Qing Chen Adrian de Wynter
2026
≈ 75%
Bootstrapping Cognitive Agents with a Large Language Model
Reid Simmons Feiyu Zhu
2026
≈ 75%
Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery
Sumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, Hector Zenil Yanbo Zhang
2025
≈ 75%
When Do You Need Billions of Words of Pretraining Data?
Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman Yian Zhang
2020
≈ 75%
Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners
Changjiang Gao, Wenhao Zhu, Jiajun Chen, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang Shimao Zhang
2024
≈ 75%
Language Models "Grok" to Copy
Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan Ang Lv
2025
≈ 75%
Do Multilingual LLMs Think In English?
Yarin Gal and Sebastian Farquhar Lisa Schut
2025
≈ 74%
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks
Hieu Minh "Jord" Nguyen
2025
≈ 74%
Evaluating Large Language Models with Psychometrics
Yue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun Yuan Li
2025
≈ 74%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 68%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 68%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 68%
Model Alignment Search
in corpus
2025
≈ 68%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 67%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 67%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 67%
Interpreting Language Model Parameters
in corpus
2026
≈ 67%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 67%
Learning without neurons in physical systems
in corpus
2022
≈ 67%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 66%
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
in corpus
2026
≈ 66%
A Mathematical Framework for Transformer Circuits
in corpus
2021
≈ 66%

Similar preprints — Semantic Scholar

Cited by (1)

The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
Semantic anchoring — the binding of a pretrained model's latent patterns to task-specific targets via external structure — predicts threshold-like performance flips with a single calibrated score S =