A Survey of Deep Learning Optimizers - First and Second Order Methods
By Rohan V Kashyap et al
Published on Aug. 10, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
1.1 Mathematical Preliminaries and Notations
1.2 Literature Overview
1.3 Flat Minima
1.4 Linear Subspace
1.5 SGD Trajectory
1.6 SGD Analysis
1.7 Curse of Dimensionality
1.8 Critical Points
1.9 Difficulties in Neural Network Optimization
2. First Order Methods
2.1 Momentum
Summary
This paper provides a comprehensive review of 14 standard optimization methods used in deep learning research, addressing the challenges of minimizing high-dimensional loss functions. It discusses topics such as saddle points, local minima, and ill-conditioning of the Hessian. The authors explore the optimization of non-convex loss functions in neural network training and the use of gradient-based techniques. Various optimization methods including gradient descent, quasi-Newton, BFGS, and conjugate gradient are examined. The paper delves into the theoretical assessment of difficulties in numerical optimization and the impact of regularization terms on the empirical risk. Additionally, it discusses the generalization capabilities of deep nets, the importance of non-linear functions in avoiding gradient problems, and the significance of optimization hyper-parameters. The study also covers the challenges of navigating error surfaces in high-dimensional spaces, the presence of saddle points, and the difficulties in distinguishing between local and global minima. Furthermore, the paper explores the application of momentum in accelerating the learning process of first-order gradient-based methods in regions of high curvature.