paper:doi-10-48550-arxiv-2308-12029Dual-Balancing for Multi-Task Learning
TL;DR
Simultaneously addressing both loss-scale and gradient-magnitude imbalance in multi-task learning yields consistent state-of-the-art performance across five benchmarks: DB-MTL (Dual-Balancing Multi-Task Learning) achieves ∆p = +1.15% on NYUv2 (versus the next-best competitor's +0.30% from GLS), +8.91% on NYUv2 with SegNet (surpassing Aligned-MTL's +8.16%), −58.10% on QM9 (versus Nash-MTL's −73.92%), and +1.05% on Office-31. DB-MTL combines two parameter-free, training-efficient components: a logarithm transformation on each task loss that provably recovers IMTL-L as a special case when IMTL-L's learnable scale parameter reaches its exact minimizer, and a maximum-norm gradient normalization that rescales all task gradients to the magnitude of the largest task gradient at each iteration via exponential moving average smoothing. On NYUv2, most competing methods—including PCGrad, CAGrad, Nash-MTL, and IMTL-G—improve semantic segmentation and depth estimation over single-task learning but degrade surface normal prediction, a failure mode DB-MTL specifically avoids. The logarithm transformation also improves six existing gradient balancing methods (PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, Aligned-MTL) when applied independently on NYUv2, and ablation confirms both components contribute additively across all five datasets. DB-MTL argues this implies that loss-scale and gradient-magnitude imbalances are complementary failure modes that neither pure loss-balancing nor pure gradient-balancing methods can resolve alone, and that a non-parametric dual correction is sufficient to close this gap without added computational overhead relative to existing gradient balancing methods.
What to take away
- 1. DB-MTL achieves ∆p = +1.15±0.16% on NYUv2 with DeepLabV3+/ResNet-50, outperforming all 21 baselines including the second-best GLS (+0.30%) and Nash-MTL (−1.01%).
- 2. On the QM9 11-task molecular property prediction benchmark, DB-MTL achieves ∆p = −58.10±3.89%, improving over the previous best MTL method Nash-MTL (−73.92%), though no MTL method surpasses single-task learning on this dataset.
- 3. On Office-31 with ResNet-18, DB-MTL achieves average test accuracy of 94.09±0.19% and ∆p = +1.05±0.20%, the only method besides IGBv2 (+0.56%) to outperform STL (93.03%).
- 4. The logarithm transformation is provably equivalent to IMTL-L's parametric loss rescaling when IMTL-L's learnable parameter st reaches its exact per-iteration minimizer (Proposition 3.1: log(x) = min_s e^s·x − s − 1), making DB-MTL's loss component parameter-free by construction.
- 5. Ablation on five datasets (Table 5) shows that combining both components strictly dominates using either alone: on NYUv2, loss-only gives ∆p = +0.06%, gradient-only gives +0.76%, and both give +1.15%; on QM9 the pattern is −74.40%, −65.73%, and −58.10%, respectively.
- 6. The choice of gradient scaling factor αk is critical: setting αk to the maximum task gradient norm outperforms constant scalings (0.01–10), minimum, mean, and median strategies by large margins on NYUv2 (Figure 6), with only the maximum-norm strategy achieving positive ∆p.
- 7. Applying the logarithm transformation to six existing gradient balancing methods (PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, Aligned-MTL) consistently improves their ∆p on NYUv2, but DB-MTL still surpasses all of these augmented baselines.
- 8. A replicable methodology choice: gradient estimates use exponential moving average (EMA) with forgetting rate β selected by grid search over {0.1, 0.5, 0.9, k^0.1_0.5, k^0.5_0.5, k^0.5_0.9} per dataset, and ablation on Office-31 shows ∆p is robust over the range β ∈ {k^0.1_0.5, …, k^0.9_0.5} but degrades at β = 0 (no EMA).
- 9. DB-MTL's per-epoch runtime on NYUv2 (NVIDIA RTX 3090) is comparable to other gradient balancing methods and higher than loss-only methods, because per-task gradients must be computed each iteration—an acknowledged limitation shared with GradNorm, CAGrad, Nash-MTL, and others.
- 10. An open question raised by the paper is whether incorporating gradient variance (in addition to gradient magnitudes) would yield further improvements, and whether theoretical convergence guarantees and optimality conditions can be derived for the dual-balancing update rule.
Peer brief — for seminar discussion
DB-MTL addresses the task-balancing problem in multi-task learning by combining two lightweight, parameter-free operations: a logarithm transformation applied to each task's loss before backpropagation, and a maximum-norm gradient normalization that rescales all task-specific gradients to the magnitude of the largest gradient norm at each iteration, estimated via exponential moving average. The method is evaluated across five benchmarks—NYUv2 (3-task scene understanding, 795 training images), Cityscapes (2-task, 2975 training images), QM9 (11-task molecular property prediction, 110k training samples), Office-31 (3-task image classification), and Office-Home (4-task image classification)—using DeepLabV3+/ResNet-50, SegNet, ResNet-18, and a graph neural network as shared encoders depending on the domain. Against 21 baselines spanning loss-balancing (UW, DWA, IMTL-L, IGBv2), gradient-balancing (GradNorm, PCGrad, Nash-MTL, Aligned-MTL, MoCo, among others), and hybrid methods (IMTL), DB-MTL achieves the best average relative improvement ∆p on all five datasets: +1.15% on NYUv2 versus GLS's +0.30%, +8.91% on NYUv2/SegNet versus Aligned-MTL's +8.16%, −58.10% on QM9 versus Nash-MTL's −73.92%, and +1.05% on Office-31. The load-bearing finding is that most competing methods improve on the easier tasks (segmentation, depth) while degrading on surface normal prediction on NYUv2—a canonical symptom of task imbalance—whereas DB-MTL is the only method to match single-task learning performance on surface normal prediction while remaining competitive on the other tasks. The paper also demonstrates that the logarithm transformation, as a standalone plug-in, improves six existing gradient balancing methods when added to them, but the full dual approach remains superior. An alternative method the authors could have used for loss balancing is IMTL-L's learned rescaling, and they provide a formal proof that IMTL-L recovers the logarithm transformation only at its exact per-iteration minimum, making IMTL-L strictly more expensive and empirically worse. The paper predicts that gradient variance, currently unaddressed, would offer further improvement if incorporated into the normalization strategy. One thing a critical reader would push back on is the reliance on ∆p—a single scalar averaging relative improvements across tasks with very different metrics and scales—as the primary performance criterion: this aggregation can mask large per-task regressions and makes it difficult to assess whether DB-MTL genuinely resolves task conflict or simply rebalances which tasks bear the cost of shared-parameter compromise. The scope is also limited to hard-parameter sharing architectures, leaving open whether dual balancing transfers to soft-sharing or mixture-of-experts designs where gradient interference manifests differently.
Findings (14)
- The logarithm transformation (loss-scale balancing) consistently outperforms IMTL-L on NYUv2, Cityscapes, Office-31, Office-Home.
Comparison of loss-scale balancing with IMTL-L.
- On NYUv2, EW suffers a drop in surface normal prediction (mean angle error 23.57 vs STL 21.99, within 11.25° 35.04 vs 39.04).
Task balancing issue where surface normal prediction degrades under EW.
- DB-MTL increases gradient cosine similarity faster and keeps it positive on Office-31, reducing gradient conflict vs EW.
Analysis of gradient conflict reduction.
- DB-MTL training losses decrease smoothly and gradient norms are lower than EW on NYUv2, indicating training stability.
Training stability analysis.
- DB-MTL has similar per-epoch running time to gradient balancing methods on NYUv2, slower than loss balancing methods.
Computational efficiency comparison.
- Logarithm transformation improves PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, and Aligned-MTL on NYUv2 (Figure 1).
Effectiveness of logarithm transformation as a plug-in for gradient balancing methods.
- The gradient-magnitude balancing method outperforms GradNorm on NYUv2, Cityscapes, Office-31, Office-Home.
Comparison of gradient-magnitude balancing with GradNorm.
- DB-MTL with EMA forgetting rate β in a wide range performs better than without EMA (β=0) on Office-31.
Effect of EMA forgetting rate on performance.
- Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).
Sensitivity analysis for gradient normalization scaling factor.
- DB-MTL achieves ∆p = +1.15±0.16 on NYUv2, outperforming all baselines including state-of-the-art
Primary empirical validation on scene understanding task
Claims (14)
- The proposed gradient-magnitude balancing method consistently outperforms GradNorm, as it guarantees equal gradient magnitudes and considers update magnitude.
Advantage over GradNorm.
- Loss-scale balancing and gradient-magnitude balancing are complementary and combining them achieves the best performance.
Ablation conclusion.
- IMTL-L is equivalent to the logarithm transformation when its parameter st is the exact minimizer in each iteration.
Mathematical relationship between IMTL-L and log transformation.
- DB-MTL is a simple yet effective method that addresses both loss-scale and gradient-magnitude imbalances.
Core claim of the paper.
- The logarithm transformation is simpler and more effective than IMTL-L because it is parameter-free.
Comparison of loss-scale balancing techniques.
- Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancing
Empirical finding on choice of αk in gradient normalization strategy
- The magnitude of the normalized gradients (choice of αk) plays an important role in performance.
Insight about gradient normalization scaling.
- Task balancing requires simultaneous consideration of both loss scales and gradient magnitudes
Core interpretive position of DB-MTL: complementarity of loss and gradient perspectives
- Logarithm transformation is simpler and more effective than learnable loss transformation
Compared to IMTL-L: parameter-free, no extra computational cost, achieves same theoretical goal
- The logarithm transformation also benefits existing gradient balancing methods.
Generalization of the loss transformation.
Hypotheses (1)
- When task gradient norms differ greatly, large-norm tasks have not converged while small-norm tasks have nearly converged
Motivates setting αk = max norm to enable further learning on under-converged tasks
Questions (1)
- task balancing problem
Core challenge where disparity in loss and gradient scales among tasks leads to performance compromises
Original abstract (expand)
Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Delve into the Applicability of Advanced Optimizers for Multi-Task LearningLinxiao Cao, Pengcheng Wu, Peilin Zhao, Chunyan Miao Zhipeng Zhou2026≈ 81%
- A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight DecayZhaoXin Liu JiangBo Zhao2026≈ 81%
- Modeling LLM Unlearning as an Asymmetric Two-Task Learning ProblemSiqing Li, Yong Wang, Xuetao Wei, Jian Yang, Yun Chen, Guanhua Chen Zeguan Xiao2026≈ 81%
- Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task DifficultyYong Wang, Zebin You, Dong Jing, Xiangxiang Chu, Zhiwu Lu Yanqi Dai2026≈ 81%
- Frequency Switching Mechanism for Parameter-E!cient Multi-Task LearningShih-Wen Liu and Yen-Chang Chen and Wei-Ta Chu and Fu-En Yang and Yu-Chiang Frank Wang2026≈ 81%
- M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation EngineeringYi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang Yanshu Li2025≈ 80%
- Performance Asymmetry in Model-Based Reinforcement LearningRushi Shah, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu Jing Yu Lim2026≈ 80%
- Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental LearningKazuhiro Hotta Taigo Sakai2026≈ 80%
- TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic BalancingGefei Lin, Annie Qu, Rui Miao Yuanpeng Li2026≈ 79%
- Weight Averaging: A Simple Yet Effective Method to Overcome Catastrophic Forgetting in Automatic Speech RecognitionSteven Vander Eeckt and Hugo Van hamme2026≈ 79%
- mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFTJeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun Woosung Koh2026≈ 79%
- When Losses Align: Gradient-Based Composite Loss Weighting for Efficient PretrainingAndrey Savchenko Ivan Karpukhin2026≈ 79%
- Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised LearningDoyoung Yoon, ByeongMoon Ji, Kyungyul Kim, Sangheum Hwang Jae-Hun Lee2023≈ 79%
- Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-TuningQiankun Li, Xiaolong Huang, Qiupu Chen, Feng He, Weijun Li, Prayag Tiwari, Xinwang Liu Xin Ning2026≈ 79%
- CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream TasksTony Alex, Muhammad Awais, Sara Ahmed Wish Suharitdamrong2026≈ 79%
- The Platonic Representation Hypothesisin corpus2024≈ 78%
- ≈ 78%
- ≈ 78%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 78%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 77%
- Model Alignment Searchin corpus2025≈ 77%
- ≈ 77%
- ≈ 76%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 76%
- ≈ 76%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 75%
+26 more