concept
active
concept:maximum-gradient-norm-scalingMaximum gradient norm scaling
Scaling aggregated gradient by the maximum gradient norm among tasks.
Neighborhood — ranked by edge-count
Methods (1)
method
- The proposed method combining loss-scale balancing via logarithm transformation and gradient-magnitude balancing via maximum-norm normalization.
Concepts (1)
concept
- gradient-magnitude balancingassociated_withAddressing disparity in gradient magnitudes across tasks at the gradient level
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Training-free technique normalizing all task gradients to the maximum gradient norm magnitude
- Setting aggregated gradient scaling factor to maximum gradient norm performs best for task balancingclaim0.827Empirical finding on choice of αk in gradient normalization strategy
- Recommended strategy for gradient normalization.
- When task gradient norms differ greatly, large-norm tasks have not converged while small-norm tasks have nearly convergedhypothesis0.753Motivates setting αk = max norm to enable further learning on under-converged tasks
- Evaluation of learned circuits on grids 4x larger with 4x more steps than training conditions
- Setting αk to the maximum gradient norm performs best among tested strategies on NYUv2 (Figure 6).finding0.744Sensitivity analysis for gradient normalization scaling factor.
- Hypothesis cited in paper suggesting deceptive capabilities may scale with model size
- Observation that SAE loss decreases as a power law with compute budget.