CMU-CS-24-147 Computer Science Department School of Computer Science, Carnegie Mellon University
Efficient Deep Learning Anders Øland Ph.D. Thesis August 2024
It is well-known that training deep neural networks is a computationally intensive process. Given the proven utility of deep learning, efficiency is thus an important concern. We study various aspects of deep networks with the goal of making the training process more efficient with respect to memory consumption and execution time. In doing so, we also discuss and contribute to various theoretical facets of the field. An unavoidable component in dealing with very large models is the need for distributing the training over many, sometimes hundreds or thousands, computational devices. This usually induces a considerable communication overhead that increases the risk of under-utilization of the system resources. To address this, we show that the entropy of the weights decreases during training, which thus become highly compressible–allowing for a considerable reduction in said overhead. It is common practice to use squashing functions, like the softmax, at the output layer of neural nets. We study the effect these functions have on the gradient signal and argue that they may contribute to the well-known vanishing gradient problem. To this end, we introduce non-squashing alternatives and provide evidence that suggests, that they improve the convergence rate. Our main contribution is in layer-wise training of deep networks. First, we make various useful observations on the properties of hidden layers and representations. Motivated by those, we show that layer-wise training can match the results of full-model backprop, while considerably reducing the memory footprint of the training process. We discuss the effect of implicit interlayer regularization (AKA the implicit bias of depth) and introduce new conjectures on its theoretical origin. Based on these, we show that interlayer regularization can be simulated in a few simple steps. Additionally, we introduce partition-wise training, which may speed up the optimization process by allowing for larger batch sizes and improved model parallelism. Finally, we take a look beyond gradient descent. Drawing on understanding gained on how and what neural nets learn, a novel solution to fitting multilayer perceptrons to training data is introduced. While it can outperform backpropagation with stochastic gradient descent on various toy problems, it tends to overfit and be capacity-hungry on more complex real data. We discuss why, and point to future ways of addressing this. This solution can be expressed in closed form, albeit we expect that it will evolve into a hybrid iterative approach. Also, we suspect that our method might be a substantially better candidate for training deep nets on quantum computers than backprop. 159 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |