paper
active
2023
5
paper:doi-10-48550-arxiv-2308-12029

Dual-Balancing for Multi-Task Learning

TL;DR

Simultaneously addressing both loss-scale and gradient-magnitude imbalance in multi-task learning yields consistent state-of-the-art performance across five benchmarks: DB-MTL (Dual-Balancing Multi-Task Learning) achieves ∆p = +1.15% on NYUv2 (versus the next-best competitor's +0.30% from GLS), +8.91% on NYUv2 with SegNet (surpassing Aligned-MTL's +8.16%), −58.10% on QM9 (versus Nash-MTL's −73.92%), and +1.05% on Office-31. DB-MTL combines two parameter-free, training-efficient components: a logarithm transformation on each task loss that provably recovers IMTL-L as a special case when IMTL-L's learnable scale parameter reaches its exact minimizer, and a maximum-norm gradient normalization that rescales all task gradients to the magnitude of the largest task gradient at each iteration via exponential moving average smoothing. On NYUv2, most competing methods—including PCGrad, CAGrad, Nash-MTL, and IMTL-G—improve semantic segmentation and depth estimation over single-task learning but degrade surface normal prediction, a failure mode DB-MTL specifically avoids. The logarithm transformation also improves six existing gradient balancing methods (PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, Aligned-MTL) when applied independently on NYUv2, and ablation confirms both components contribute additively across all five datasets. DB-MTL argues this implies that loss-scale and gradient-magnitude imbalances are complementary failure modes that neither pure loss-balancing nor pure gradient-balancing methods can resolve alone, and that a non-parametric dual correction is sufficient to close this gap without added computational overhead relative to existing gradient balancing methods.

What to take away

  1. 1. DB-MTL achieves ∆p = +1.15±0.16% on NYUv2 with DeepLabV3+/ResNet-50, outperforming all 21 baselines including the second-best GLS (+0.30%) and Nash-MTL (−1.01%).
  2. 2. On the QM9 11-task molecular property prediction benchmark, DB-MTL achieves ∆p = −58.10±3.89%, improving over the previous best MTL method Nash-MTL (−73.92%), though no MTL method surpasses single-task learning on this dataset.
  3. 3. On Office-31 with ResNet-18, DB-MTL achieves average test accuracy of 94.09±0.19% and ∆p = +1.05±0.20%, the only method besides IGBv2 (+0.56%) to outperform STL (93.03%).
  4. 4. The logarithm transformation is provably equivalent to IMTL-L's parametric loss rescaling when IMTL-L's learnable parameter st reaches its exact per-iteration minimizer (Proposition 3.1: log(x) = min_s e^s·x − s − 1), making DB-MTL's loss component parameter-free by construction.
  5. 5. Ablation on five datasets (Table 5) shows that combining both components strictly dominates using either alone: on NYUv2, loss-only gives ∆p = +0.06%, gradient-only gives +0.76%, and both give +1.15%; on QM9 the pattern is −74.40%, −65.73%, and −58.10%, respectively.
  6. 6. The choice of gradient scaling factor αk is critical: setting αk to the maximum task gradient norm outperforms constant scalings (0.01–10), minimum, mean, and median strategies by large margins on NYUv2 (Figure 6), with only the maximum-norm strategy achieving positive ∆p.
  7. 7. Applying the logarithm transformation to six existing gradient balancing methods (PCGrad, GradVac, IMTL-G, CAGrad, Nash-MTL, Aligned-MTL) consistently improves their ∆p on NYUv2, but DB-MTL still surpasses all of these augmented baselines.
  8. 8. A replicable methodology choice: gradient estimates use exponential moving average (EMA) with forgetting rate β selected by grid search over {0.1, 0.5, 0.9, k^0.1_0.5, k^0.5_0.5, k^0.5_0.9} per dataset, and ablation on Office-31 shows ∆p is robust over the range β ∈ {k^0.1_0.5, …, k^0.9_0.5} but degrades at β = 0 (no EMA).
  9. 9. DB-MTL's per-epoch runtime on NYUv2 (NVIDIA RTX 3090) is comparable to other gradient balancing methods and higher than loss-only methods, because per-task gradients must be computed each iteration—an acknowledged limitation shared with GradNorm, CAGrad, Nash-MTL, and others.
  10. 10. An open question raised by the paper is whether incorporating gradient variance (in addition to gradient magnitudes) would yield further improvements, and whether theoretical convergence guarantees and optimality conditions can be derived for the dual-balancing update rule.

Peer brief — for seminar discussion

DB-MTL addresses the task-balancing problem in multi-task learning by combining two lightweight, parameter-free operations: a logarithm transformation applied to each task's loss before backpropagation, and a maximum-norm gradient normalization that rescales all task-specific gradients to the magnitude of the largest gradient norm at each iteration, estimated via exponential moving average. The method is evaluated across five benchmarks—NYUv2 (3-task scene understanding, 795 training images), Cityscapes (2-task, 2975 training images), QM9 (11-task molecular property prediction, 110k training samples), Office-31 (3-task image classification), and Office-Home (4-task image classification)—using DeepLabV3+/ResNet-50, SegNet, ResNet-18, and a graph neural network as shared encoders depending on the domain. Against 21 baselines spanning loss-balancing (UW, DWA, IMTL-L, IGBv2), gradient-balancing (GradNorm, PCGrad, Nash-MTL, Aligned-MTL, MoCo, among others), and hybrid methods (IMTL), DB-MTL achieves the best average relative improvement ∆p on all five datasets: +1.15% on NYUv2 versus GLS's +0.30%, +8.91% on NYUv2/SegNet versus Aligned-MTL's +8.16%, −58.10% on QM9 versus Nash-MTL's −73.92%, and +1.05% on Office-31. The load-bearing finding is that most competing methods improve on the easier tasks (segmentation, depth) while degrading on surface normal prediction on NYUv2—a canonical symptom of task imbalance—whereas DB-MTL is the only method to match single-task learning performance on surface normal prediction while remaining competitive on the other tasks. The paper also demonstrates that the logarithm transformation, as a standalone plug-in, improves six existing gradient balancing methods when added to them, but the full dual approach remains superior. An alternative method the authors could have used for loss balancing is IMTL-L's learned rescaling, and they provide a formal proof that IMTL-L recovers the logarithm transformation only at its exact per-iteration minimum, making IMTL-L strictly more expensive and empirically worse. The paper predicts that gradient variance, currently unaddressed, would offer further improvement if incorporated into the normalization strategy. One thing a critical reader would push back on is the reliance on ∆p—a single scalar averaging relative improvements across tasks with very different metrics and scales—as the primary performance criterion: this aggregation can mask large per-task regressions and makes it difficult to assess whether DB-MTL genuinely resolves task conflict or simply rebalances which tasks bear the cost of shared-parameter compromise. The scope is also limited to hard-parameter sharing architectures, leaving open whether dual balancing transfers to soft-sharing or mixture-of-experts designs where gradient interference manifests differently.

Findings (14)

Claims (14)

Hypotheses (1)

Questions (1)

  • task balancing problem

    Core challenge where disparity in loss and gradient scales among tasks leads to performance compromises

Original abstract (expand)

Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+26 more

Similar preprints — Semantic Scholar