Setting αk as the maximum gradient norm among tasks performs best.

Recommended strategy for gradient normalization.

Source paper

extracted_from

Dual-Balancing for Multi-Task Learning

(2023) · Baijiong Lin · Weisen Jiang · Feiyang Ye · Yu Zhang +5

Neighborhood — ranked by edge-count

Findings (1)

finding

Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).
supports
Sensitivity analysis for gradient normalization scaling factor.

Communities (3)

community

Dual-balancing multi-task learning
members_of
DB-MTL jointly balances loss scale and gradient magnitude, benchmarked on NYUv2 and Office-31.
Dual balancing multi-task learning
members_of
DB-MTL combines loss-scale and gradient-magnitude balancing, benchmarked across NYUv2, Cityscapes, QM9, and Office datasets.
Gradient norm scaling for multitask learning
members_of
Investigates optimal gradient balancing strategies across tasks, finding maximum gradient norm normalization outperforms alternatives in multitask optimization.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancingclaim0.832
Empirical finding on choice of αk in gradient normalization strategy
The magnitude of the normalized gradients (choice of αk) plays an important role in performance.claim0.827
Insight about gradient normalization scaling.
Maximum gradient norm scalingconcept0.800
Scaling aggregated gradient by the maximum gradient norm among tasks.
maximum-norm gradient normalizationmethod0.787
Training-free technique normalizing all task gradients to the maximum gradient norm magnitude
When task gradient norms differ greatly, large-norm tasks have not converged while small-norm tasks have nearly convergedhypothesis0.760
Motivates setting αk = max norm to enable further learning on under-converged tasks
Software implementations for all of the models/behaviours presented are common for n = 2, and can be made very efficient for α_i that map many objects onto a much smaller set of object families.claim0.749
Claim about current practical feasibility and efficiency of 2-way associative implementations.
The proposed gradient-magnitude balancing method consistently outperforms GradNorm, as it guarantees equal gradient magnitudes and considers update magnitude.claim0.746
Advantage over GradNorm.
We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.744
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.