SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs (via Zoom)

Abstract: Multi epoch, small batch, Stochastic Gradient Descent (SGD) has been the method of choice for training large overparameterized deep learning models. A popular theory for explaining why SGD solutions generalize well is that SGD algorithm perhaps has an implicit regularization that is biasing its output to good solutions. Indeed, for certain simple models, prior works have worked out the exact implicit regularizer that corresponds to running SGD. However, we prove in this paper that in general no such implicit regularization can explain the generalization of SGD. In fact, constructing specific instances of both stochastic convex optimization problems and restricted deep learning networks, we demonstrate that there are learning problems where SGD learns but no regularized empirical risk minimizer can match the performance of SGD. We also discuss the role of small batch size and multiple epochs in explaining the empirical success of SGD for deep learning.

Bio: Ayush Sekhari is a 4th year PhD student in the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span across online learning, reinforcement learning and control, optimization and the interplay between them. Before coming to Cornell, he spent a year at Google as a part of the Brain residency program. Before Google, he completed his undergraduate studies in computer science from IIT Kanpur in India.