Regularization is typically part of the solution of an inverse problem. In Machine Learning, however, regularization is part of the problem itself, since the desired solution does not typically minimize the data-dependent loss function. I will discuss both implicit and explicit regularization in deep learning, and present experimens that illustrate their mechanisms of action. Implicit regularization includes a variety of inductive biases in the optimization and the choice of class of functions (architectured) selected. I will discuss the connection between these forms of regularization and generalization, through the notion of ‘Information in the Weights’. Explicit regularization includes weight decay and data augmentation. While typically these are understood as altering the landscape of local extrema to which the model eventually converges, I will show experiments that challenge this view: Removing regularization after an initial transient period has little effect on generalization, even if the final loss landscape is the same as if there had been no regularization. In some cases, generalization even improves after interrupting regularization. Con- versely, if regularization is applied only after the initial transient, it has no effect on the final solution, whose generalization gap is as bad as if regularization never happened. This suggests that what matters for training deep networks is not just whether or how, but when to regularize. The phenomena collectively suggest that there is a “critical period” for regularizing deep networks that is decisive of the final performance.
JOINT WORK WITH: Aditya Golaktar and Alessandro Achille