The success of deep learning is due, to a great extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. I will discuss some general mathematical principles allowing for efficient optimization in over-parameterized systems of non-linear equations, a setting that includes deep neural networks. In particular, optimization problems corresponding to such systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD. We connect the PL condition of these systems to the condition number associated to the tangent kernel and develop a non-linear theory parallel to classical analyses of over-parameterized linear equations. We discuss how these ideas apply to training shallow and deep neural networks.
Finally, I will discuss how our analysis gives a different perspective on the recent ideas around Neural Tangent Kernels.
Joint work with Chaoyue Liu and Libin Zhu