A modern study by researchers at MIT and Brown University characterizes several properties that emerge when training deep classifiers, a type of artificial neural network commonly used for classification tasks such as image classification, speech recognition and natural language processing.
Paper, “Dynamics in deep classifiers trained with quadratic loss: Normalization, low rank, neural collapse, and the limits of generalization,” published today in the journal, is the first of its kind to theoretically explore the dynamics of training deep classifiers using quadratic loss and how properties such as rank minimization, neural collapse, and duality between neural activation and layer weights are related to each other.
In the study, the authors focused on two types of deep classifiers: fully connected deep networks and convolutional neural networks (CNNs).
AND previous study investigated the structural properties that develop in enormous neural networks in the final stages of training. This study focused on the last layer of the network and found that deep networks trained to fit a training dataset will eventually reach a condition known as “neural collapse.” When neural collapse occurs, the network maps multiple examples of a particular class (such as images of cats) onto a single template of that class. Ideally, the templates for each class should be as far apart as possible, allowing the network to accurately classify modern examples.
The MIT group, based at the MIT Center for Brains, Minds and Machines, studied the conditions under which networks could reach neural collapse. Deep networks, which consist of three components: stochastic gradient descent (SGD), weight decay regularization (WD), and weight normalization (WN), will exhibit neural collapse if trained to fit training data. The MIT group took a theoretical approach – compared to the empirical approach of the earlier study – proving that the nervous breakdown results from minimizing the quadratic loss using SGD, WD and HV.
Co-author and MIT McGovern Institute postdoc Akshay Rangamani states: “Our analysis shows that neural collapse results from minimizing the quadratic loss for highly expressive deep neural networks. It also highlights the key role that regularization of body weight loss and stochastic gradient descent play in driving resolution to neural collapse.”
Weight decay is a regularization technique that prevents the network from overfitting the training data by reducing the size of the weights. Weight normalization scales the network’s weight matrices to be of similar scale. Low rank refers to the property of a matrix in which it has a diminutive number of non-zero singular values. Generalization limits guarantee the network’s ability to accurately predict modern examples that it has not seen during training.
The authors found that the same theoretical observation that predicts low-order bias also predicts the existence of intrinsic SGD noise in the weight matrices and at the network output. This noise is not generated by the randomness of the SGD algorithm, but by an engaging active trade-off between rank minimization and data fitting, which constitutes an internal source of noise similar to what happens in dynamical systems in the disordered regime. Such a random search can be beneficial for generalization because it can prevent overfitting.
“Interestingly, this result supports classical generalization theory by showing that time-honored boundaries matter. It also provides a theoretical explanation for the superior performance in many tasks of scant networks such as CNNs compared to dense networks,” comments co-author and postdoc Tomer Galanti at the MIT McGovern Institute. In fact, the authors prove modern norm-based generalization bounds for CNNs with localized kernels, i.e., a scant connectivity network in their weight matrices.
In this case, generalization can be orders of magnitude better than densely connected networks. This result confirms the classical theory of generalization, showing that its limits are significant, and contradicts many recent articles expressing doubts about the current approach to generalization. It also provides a theoretical explanation for the better performance of scant networks such as CNNs compared to dense networks. So far, the fact that CNNs, and not dense networks, represent the success story of deep networks has been almost completely ignored by machine learning theory. Instead, the theory presented here suggests that it is an essential insight into why deep networks work so well.
“This study provides one of the first theoretical analyzes involving optimization, generalization, and approximation in deep networks and offers new insight into the properties that emerge during training,” says co-author Tomaso Poggio, the Eugene McDermott Professor in the Department of Brain and Cognitive Sciences at MIT and co-director of the Brain Center , Mind and Machines. “Our results may help us better understand why deep learning works so well.”