Dual-Balancing for Multi-Task Learning

ByBaijiong Lin ⓘ·Weisen Jiang·Feiyang Ye ⓘ·Yu Zhang ⓘ·Pengguang Chen ⓘ·Ying-Cong Chen ⓘ+3 moreThe Hong Kong University of Science and Technology (Guangzhou)

DOI 10.48550/arxiv.2308.12029 arXiv 2308.12029 OpenAlex W4386148469

LLM Interpretability & Behavioral Analysis LLM interpretability & self-awareness Gradient conflict Multi-objective optimization Relative performance improvement (Δp)Task balancing Task-sharing parameter Task-specific parameter Task weight

TL;DR

Simultaneously addressing both loss-scale and gradient-magnitude imbalance in multi-task learning yields consistent state-of-the-art performance across five benchmarks: DB-MTL (Dual-Balancing Multi-Task Learning) achieves ∆p = +1.15% on NYUv2 (versus the next-best competitor's +0.30% from GLS), +8.91% on NYUv2 with SegNet (surpassing Aligned-MTL's +8.16%), −58.10% on QM9 (versus Nash-MTL's −73.92%), and +1.05% on Office-31. DB-MTL combines two parameter-free, training-efficient components: a logarithm transformation on each task loss that provably recovers IMTL-L as a special case when IMTL-L's learnable scale parameter reaches its exact minimizer, and a maximum-norm gradient normalization that rescales all task gradients to the magnitude of the largest task gradient at each iteration via exponential moving average smoothing. On NYUv2, most competing methods—including PCGrad, CAGrad, Nash-MTL, and IMTL-G—improve semantic segmentation and depth estimation over single-task learning but degrade surface normal prediction, a failure mode DB-MTL specifically avoids. The logarithm transformation also improves six existing gradient balancing methods (PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, Aligned-MTL) when applied independently on NYUv2, and ablation confirms both components contribute additively across all five datasets. DB-MTL argues this implies that loss-scale and gradient-magnitude imbalances are complementary failure modes that neither pure loss-balancing nor pure gradient-balancing methods can resolve alone, and that a non-parametric dual correction is sufficient to close this gap without added computational overhead relative to existing gradient balancing methods.

What to take away

1. DB-MTL achieves ∆p = +1.15±0.16% on NYUv2 with DeepLabV3+/ResNet-50, outperforming all 21 baselines including the second-best GLS (+0.30%) and Nash-MTL (−1.01%).
2. On the QM9 11-task molecular property prediction benchmark, DB-MTL achieves ∆p = −58.10±3.89%, improving over the previous best MTL method Nash-MTL (−73.92%), though no MTL method surpasses single-task learning on this dataset.
3. On Office-31 with ResNet-18, DB-MTL achieves average test accuracy of 94.09±0.19% and ∆p = +1.05±0.20%, the only method besides IGBv2 (+0.56%) to outperform STL (93.03%).
4. The logarithm transformation is provably equivalent to IMTL-L's parametric loss rescaling when IMTL-L's learnable parameter st reaches its exact per-iteration minimizer (Proposition 3.1: log(x) = min_s e^s·x − s − 1), making DB-MTL's loss component parameter-free by construction.
5. Ablation on five datasets (Table 5) shows that combining both components strictly dominates using either alone: on NYUv2, loss-only gives ∆p = +0.06%, gradient-only gives +0.76%, and both give +1.15%; on QM9 the pattern is −74.40%, −65.73%, and −58.10%, respectively.
6. The choice of gradient scaling factor αk is critical: setting αk to the maximum task gradient norm outperforms constant scalings (0.01–10), minimum, mean, and median strategies by large margins on NYUv2 (Figure 6), with only the maximum-norm strategy achieving positive ∆p.
7. Applying the logarithm transformation to six existing gradient balancing methods (PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, Aligned-MTL) consistently improves their ∆p on NYUv2, but DB-MTL still surpasses all of these augmented baselines.
8. A replicable methodology choice: gradient estimates use exponential moving average (EMA) with forgetting rate β selected by grid search over {0.1, 0.5, 0.9, k^0.1_0.5, k^0.5_0.5, k^0.5_0.9} per dataset, and ablation on Office-31 shows ∆p is robust over the range β ∈ {k^0.1_0.5, …, k^0.9_0.5} but degrades at β = 0 (no EMA).
9. DB-MTL's per-epoch runtime on NYUv2 (NVIDIA RTX 3090) is comparable to other gradient balancing methods and higher than loss-only methods, because per-task gradients must be computed each iteration—an acknowledged limitation shared with GradNorm, CAGrad, Nash-MTL, and others.
10. An open question raised by the paper is whether incorporating gradient variance (in addition to gradient magnitudes) would yield further improvements, and whether theoretical convergence guarantees and optimality conditions can be derived for the dual-balancing update rule.

Peer brief — for seminar discussion

DB-MTL addresses the task-balancing problem in multi-task learning by combining two lightweight, parameter-free operations: a logarithm transformation applied to each task's loss before backpropagation, and a maximum-norm gradient normalization that rescales all task-specific gradients to the magnitude of the largest gradient norm at each iteration, estimated via exponential moving average. The method is evaluated across five benchmarks—NYUv2 (3-task scene understanding, 795 training images), Cityscapes (2-task, 2975 training images), QM9 (11-task molecular property prediction, 110k training samples), Office-31 (3-task image classification), and Office-Home (4-task image classification)—using DeepLabV3+/ResNet-50, SegNet, ResNet-18, and a graph neural network as shared encoders depending on the domain. Against 21 baselines spanning loss-balancing (UW, DWA, IMTL-L, IGBv2), gradient-balancing (GradNorm, PCGrad, Nash-MTL, Aligned-MTL, MoCo, among others), and hybrid methods (IMTL), DB-MTL achieves the best average relative improvement ∆p on all five datasets: +1.15% on NYUv2 versus GLS's +0.30%, +8.91% on NYUv2/SegNet versus Aligned-MTL's +8.16%, −58.10% on QM9 versus Nash-MTL's −73.92%, and +1.05% on Office-31. The load-bearing finding is that most competing methods improve on the easier tasks (segmentation, depth) while degrading on surface normal prediction on NYUv2—a canonical symptom of task imbalance—whereas DB-MTL is the only method to match single-task learning performance on surface normal prediction while remaining competitive on the other tasks. The paper also demonstrates that the logarithm transformation, as a standalone plug-in, improves six existing gradient balancing methods when added to them, but the full dual approach remains superior. An alternative method the authors could have used for loss balancing is IMTL-L's learned rescaling, and they provide a formal proof that IMTL-L recovers the logarithm transformation only at its exact per-iteration minimum, making IMTL-L strictly more expensive and empirically worse. The paper predicts that gradient variance, currently unaddressed, would offer further improvement if incorporated into the normalization strategy. One thing a critical reader would push back on is the reliance on ∆p—a single scalar averaging relative improvements across tasks with very different metrics and scales—as the primary performance criterion: this aggregation can mask large per-task regressions and makes it difficult to assess whether DB-MTL genuinely resolves task conflict or simply rebalances which tasks bear the cost of shared-parameter compromise. The scope is also limited to hard-parameter sharing architectures, leaving open whether dual balancing transfers to soft-sharing or mixture-of-experts designs where gradient interference manifests differently.

Findings (14)

The logarithm transformation (loss-scale balancing) consistently outperforms IMTL-L on NYUv2, Cityscapes, Office-31, Office-Home.
Comparison of loss-scale balancing with IMTL-L.
On NYUv2, EW suffers a drop in surface normal prediction (mean angle error 23.57 vs STL 21.99, within 11.25° 35.04 vs 39.04).
Task balancing issue where surface normal prediction degrades under EW.
DB-MTL increases gradient cosine similarity faster and keeps it positive on Office-31, reducing gradient conflict vs EW.
Analysis of gradient conflict reduction.
DB-MTL training losses decrease smoothly and gradient norms are lower than EW on NYUv2, indicating training stability.
Training stability analysis.
DB-MTL has similar per-epoch running time to gradient balancing methods on NYUv2, slower than loss balancing methods.
Computational efficiency comparison.
Logarithm transformation improves PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, and Aligned-MTL on NYUv2 (Figure 1).
Effectiveness of logarithm transformation as a plug-in for gradient balancing methods.
The gradient-magnitude balancing method outperforms GradNorm on NYUv2, Cityscapes, Office-31, Office-Home.
Comparison of gradient-magnitude balancing with GradNorm.
DB-MTL with EMA forgetting rate β in a wide range performs better than without EMA (β=0) on Office-31.
Effect of EMA forgetting rate on performance.
Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).
Sensitivity analysis for gradient normalization scaling factor.
DB-MTL achieves ∆p = +1.15±0.16 on NYUv2, outperforming all baselines including state-of-the-art
Primary empirical validation on scene understanding task

Claims (14)

The proposed gradient-magnitude balancing method consistently outperforms GradNorm, as it guarantees equal gradient magnitudes and considers update magnitude.
Advantage over GradNorm.
Loss-scale balancing and gradient-magnitude balancing are complementary and combining them achieves the best performance.
Ablation conclusion.
IMTL-L is equivalent to the logarithm transformation when its parameter st is the exact minimizer in each iteration.
Mathematical relationship between IMTL-L and log transformation.
DB-MTL is a simple yet effective method that addresses both loss-scale and gradient-magnitude imbalances.
Core claim of the paper.
The logarithm transformation is simpler and more effective than IMTL-L because it is parameter-free.
Comparison of loss-scale balancing techniques.
Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancing
Empirical finding on choice of αk in gradient normalization strategy
The magnitude of the normalized gradients (choice of αk) plays an important role in performance.
Insight about gradient normalization scaling.
Task balancing requires simultaneous consideration of both loss scales and gradient magnitudes
Core interpretive position of DB-MTL: complementarity of loss and gradient perspectives
Logarithm transformation is simpler and more effective than learnable loss transformation
Compared to IMTL-L: parameter-free, no extra computational cost, achieves same theoretical goal
The logarithm transformation also benefits existing gradient balancing methods.
Generalization of the loss transformation.

Hypotheses (1)

When task gradient norms differ greatly, large-norm tasks have not converged while small-norm tasks have nearly converged
Motivates setting αk = max norm to enable further learning on under-converged tasks

Questions (1)

task balancing problem
Core challenge where disparity in loss and gradient scales among tasks leads to performance compromises

Original abstract (expand)

Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Delve into the Applicability of Advanced Optimizers for Multi-Task Learning
Linxiao Cao, Pengcheng Wu, Peilin Zhao, Chunyan Miao Zhipeng Zhou
2026
≈ 81%
A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
ZhaoXin Liu JiangBo Zhao
2026
≈ 81%
Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Siqing Li, Yong Wang, Xuetao Wei, Jian Yang, Yun Chen, Guanhua Chen Zeguan Xiao
2026
≈ 81%
Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty
Yong Wang, Zebin You, Dong Jing, Xiangxiang Chu, Zhiwu Lu Yanqi Dai
2026
≈ 81%
Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning
Shih-Wen Liu and Yen-Chang Chen and Wei-Ta Chu and Fu-En Yang and Yu-Chiang Frank Wang
2026
≈ 81%
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang Yanshu Li
2025
≈ 80%
Performance Asymmetry in Model-Based Reinforcement Learning
Rushi Shah, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu Jing Yu Lim
2026
≈ 80%
Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning
Kazuhiro Hotta Taigo Sakai
2026
≈ 80%
TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
Gefei Lin, Annie Qu, Rui Miao Yuanpeng Li
2026
≈ 79%
Weight Averaging: A Simple Yet Effective Method to Overcome Catastrophic Forgetting in Automatic Speech Recognition
Steven Vander Eeckt and Hugo Van hamme
2026
≈ 79%
mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT
Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun Woosung Koh
2026
≈ 79%
When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining
Andrey Savchenko Ivan Karpukhin
2026
≈ 79%
Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning
Doyoung Yoon, ByeongMoon Ji, Kyungyul Kim, Sangheum Hwang Jae-Hun Lee
2023
≈ 79%
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
Qiankun Li, Xiaolong Huang, Qiupu Chen, Feng He, Weijun Li, Prayag Tiwari, Xinwang Liu Xin Ning
2026
≈ 79%
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
Tony Alex, Muhammad Awais, Sara Ahmed Wish Suharitdamrong
2026
≈ 79%
The Platonic Representation Hypothesis
in corpus
2024
≈ 78%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 78%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 78%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 78%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 77%
Model Alignment Search
in corpus
2025
≈ 77%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 77%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 76%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 76%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 76%
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
in corpus
2026
≈ 75%
2023), for each dataset, we perform grid search for β over {0.1
cited
2023
ADADELTA: An Adaptive Learning Rate Method
cited
2012
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
cited
2014
Adam: A Method for Stochastic Optimization
cited
2014

+26 more