finding

active

finding:setting-k-to-the-maximum-gradient-norm-performs-best-among-tested-strategies-on-nyuv2-figure-6

Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).

Sensitivity analysis for gradient normalization scaling factor.

Source paper

extracted_from

Dual-Balancing for Multi-Task Learning

(2023) · Baijiong Lin · Weisen Jiang · Feiyang Ye · Yu Zhang +5

Neighborhood — ranked by edge-count

Claims (2)

claim

Setting αk as the maximum gradient norm among tasks performs best.
supports
Recommended strategy for gradient normalization.
The magnitude of the normalized gradients (choice of αk) plays an important role in performance.
supports
Insight about gradient normalization scaling.

Communities (2)

community

Dual-balancing multi-task learning
members_of
DB-MTL jointly balances loss scale and gradient magnitude, benchmarked on NYUv2 and Office-31.
Gradient norm scaling for multitask learning
members_of
Investigates optimal gradient balancing strategies across tasks, finding maximum gradient norm normalization outperforms alternatives in multitask optimization.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DB-MTL training losses decrease smoothly and gradient norms are lower than EW on NYUv2, indicating training stability.finding0.790
Training stability analysis.
Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancingclaim0.770
Empirical finding on choice of αk in gradient normalization strategy
The gradient-magnitude balancing method outperforms GradNorm on NYUv2, Cityscapes, Office-31, Office-Home.finding0.759
Comparison of gradient-magnitude balancing with GradNorm.
The proposed gradient-magnitude balancing method consistently outperforms GradNorm, as it guarantees equal gradient magnitudes and considers update magnitude.claim0.755
Advantage over GradNorm.
Combining loss-scale and gradient-magnitude balancing achieves Δp = +1.15±0.16 on NYUv2.finding0.752
Full DB-MTL ablation result.
If EI maximization is used as a regularization in representation learning, then OOD generalization will improve beyond current invariant risk minimization methods.hypothesis0.751
Proposed conjecture in §4.3.1.
Software implementations for all of the models/behaviours presented are common for n = 2, and can be made very efficient for α_i that map many objects onto a much smaller set of object families.claim0.749
Claim about current practical feasibility and efficiency of 2-way associative implementations.
DB-MTL has similar per-epoch running time to gradient balancing methods on NYUv2, slower than loss balancing methods.finding0.745
Computational efficiency comparison.