Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks. The precise manner in which this occurs has thus far been elusive. I will show that SGD solves a variational optimization problem: it minimizes an average potential over the posterior distribution of weights along with an entropic regularization term.
I will show that due to highly non-isotropic mini-batch gradient noise for deep networks, the above potential is not the original loss function; SGD minimizes a different loss than the one it computes its gradients on. More surprisingly, SGD does not even converge in the classical sense: most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they are limit cycles with deterministic dynamics in the weight space, far away from critical points of the original loss. I will also discuss connections of such “out-of-equilibrium” behavior of SGD with the generalization performance of deep networks.
arXiv: https://arxiv.org/abs/1710.11029