Maximum gradient norm scaling

Scaling aggregated gradient by the maximum gradient norm among tasks.

Neighborhood — ranked by edge-count

method

DB-MTL (Dual-Balancing Multi-Task Learning)
implements
The proposed method combining loss-scale balancing via logarithm transformation and gradient-magnitude balancing via maximum-norm normalization.

concept

gradient-magnitude balancing
associated_with
Addressing disparity in gradient magnitudes across tasks at the gradient level

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

maximum-norm gradient normalizationmethod0.898
Training-free technique normalizing all task gradients to the maximum gradient norm magnitude
Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancingclaim0.827
Empirical finding on choice of αk in gradient normalization strategy
Setting αk as the maximum gradient norm among tasks performs best.claim0.800
Recommended strategy for gradient normalization.
When task gradient norms differ greatly, large-norm tasks have not converged while small-norm tasks have nearly convergedhypothesis0.753
Motivates setting αk = max norm to enable further learning on under-converged tasks
Grid Scaling Generalization Testmethod0.753
Evaluation of learned circuits on grids 4x larger with 4x more steps than training conditions
Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).finding0.744
Sensitivity analysis for gradient normalization scaling factor.
Inverse Scaling Lawconcept0.743
Hypothesis cited in paper suggesting deceptive capabilities may scale with model size
Power law scalingconcept0.739
Observation that SAE loss decreases as a power law with compute budget.