claim
active
claim:setting-k-as-the-maximum-gradient-norm-among-tasks-performs-bestSetting αk as the maximum gradient norm among tasks performs best.
Recommended strategy for gradient normalization.
Source paper
extracted_from(2023) · Baijiong Lin · Weisen Jiang · Feiyang Ye · Yu Zhang +5
Neighborhood — ranked by edge-count
Findings (1)
finding
- Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).supportsSensitivity analysis for gradient normalization scaling factor.
Communities (3)
community
- Dual-balancing multi-task learningmembers_ofDB-MTL jointly balances loss scale and gradient magnitude, benchmarked on NYUv2 and Office-31.
- Dual balancing multi-task learningmembers_ofDB-MTL combines loss-scale and gradient-magnitude balancing, benchmarked across NYUv2, Cityscapes, QM9, and Office datasets.
- Investigates optimal gradient balancing strategies across tasks, finding maximum gradient norm normalization outperforms alternatives in multitask optimization.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancingclaim0.832Empirical finding on choice of αk in gradient normalization strategy
- The magnitude of the normalized gradients (choice of αk) plays an important role in performance.claim0.827Insight about gradient normalization scaling.
- Scaling aggregated gradient by the maximum gradient norm among tasks.
- Training-free technique normalizing all task gradients to the maximum gradient norm magnitude
- When task gradient norms differ greatly, large-norm tasks have not converged while small-norm tasks have nearly convergedhypothesis0.760Motivates setting αk = max norm to enable further learning on under-converged tasks
- Claim about current practical feasibility and efficiency of 2-way associative implementations.
- Advantage over GradNorm.
- We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.744Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.