Modern neural network architectures often comprise thousands of hidden units, and
millions of weights that are trained via gradient descent (GD) or stochastic gradient descent (SGD).
In physics, systems with a large number of degrees of freedom often admit a simplified (macroscopic)
description. Is there an analogous macroscopic description of the dynamics of multi-layer neural networks?
I will focus on the case of two-layers (one-hidden-layer) fully connected networks, and will discuss two
specific ways to take the large system limit. These mathematical constructions capture two regimes
of the learning process:
1) The lazy regime, in which the network essentially behave as a linear random features model;
2) The mean field regime, in which the network follows a genuinely non-linear dynamics and learns good
representations of the data.
I will compare the two regimes, and discuss for which learning tasks we expect to see a separation between them.
[Based on joint work with Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz]