When Do Flat Minima Optimizers Work?

By Jean Kaddour et al
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Background and Related Work
2.1 Stochastic Gradient Descent (SGD)
2.2 Stochastic Weight Averaging (SWA)
2.3 Sharpness-Aware Minimization (SAM)
2.4 Other Flat-Minima Optimizers
3. How do minima found by SWA and SAM differ?
3.1 What is between non-flat and flat solutions?
3.2 What happens if we average SAM iterates?
3.3 How 'flat' are the found minima?

Summary

Recently, flat-minima optimizers have shown improvements in neural network generalization performance. Two popular methods, Stochastic Weight Averaging (SWA) and Sharpness-Aware Minimization (SAM), have received attention. This paper compares the properties and benchmarking of SWA and SAM across different domains. The study reveals surprising findings and aims to assist researchers and practitioners in choosing the right optimizer for their tasks. The authors delve into the mechanics behind SWA and SAM, their performance over various tasks, and the implications of finding flat minima. The comparison between SWA and SAM solutions in deep learning tasks provides insights into their geometric properties and generalization abilities. Averaging SAM iterates may further enhance generalization, leading to Weight-Averaged Sharpness-Aware Minimization (WASAM). The study also quantifies the flatness of minima found by SWA and SAM, showing that SAM leads to flatter minima than SWA.
×
This is where the content will go.