超参数
计算机科学
一般化
迭代函数
随机梯度下降算法
人工智能
指数函数
稳健性(进化)
深度学习
正规化(语言学)
泛化误差
机器学习
应用数学
数学
人工神经网络
数学分析
基因
生物化学
化学
作者
Daniel Morales Brotons,Thijs Vogels,Hadrien Hendrikx
出处
期刊:Cornell University - arXiv
日期:2024-11-27
被引量:8
标识
DOI:10.48550/arxiv.2411.18704
摘要
Weight averaging of Stochastic Gradient Descent (SGD) iterates is a popular method for training deep learning models. While it is often used as part of complex training pipelines to improve generalization or serve as a `teacher' model, weight averaging lacks proper evaluation on its own. In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning. Therefore, we suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
科研通智能强力驱动
Strongly Powered by AbleSci AI