Tensor Programs IVb: Adaptive Optimization In The ∞-Width Limit
By Greg Yang et al
Published on Aug. 7, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
1.1 Related Work
1.2 Notations
1.2.1 The Tensor Program Ansatz: Representing Vectors via Random Variables
1.2.2 IID Copies
1.2.3 Big-O Notation
2 Exposition of Main Results
2.1 Optimizers
2.2 abcd-Parametrizations
2.3 Setup
2.4 Neural Tangent
2.4.1 Continuous Time Intuition
2.4.2 Infinite-Width Limit
2.4.3 Lack of Feature Learning
2.5 Maximal Update
2.5.1 Shallow Infinite-Width Limit
2.6 N E⊗OR⊤: Tensor Program with Nonlinear Outer Products
2.6.1 The N E⊗OR⊤Language
2.6.2 Setups
2.6.3 Limit Objects
2.6.4 The Master Theorem
2.7 Maximal Update for Deep MLP
2.8 Dynamical Dichotomy: The Classification of abcd-Parametrizations
2.8.1 Technical Assumptions
2.8.2 Size of Feature Learning
2.8.3 Stability and Faithfulness
2.8.4 Nontriviality
2.8.5 Feature Learning and Operator Regimes
2.8.6 Classification of abcd-Parametrizations
2.9 Infinite-Width Limits for Any Architecture
2.9.1 Representating Architectures via Tensor Programs
2.9.2 abcd-Parametrization for Any Architecture
2.9.3 Interlude: Backpropagation and Total Programs
2.9.4 Training Setup
2.9.5 Neural Tangent Limit
2.9.6 Maximal Update Limit
2.10 Extensions
2.10.1 Weight Decay
2.10.2 Update Clipping and Normalization
2.10.3 Second Moment Factoring ala Adafactor
2.10.4 Future Optimizers
2.11 Proof Sketch
3 Proofs of Infinite-Width Limits
3.1 Proof of Classification of abcd-Parametrizations
3.1.1 Program Setup
3.1.2 Program Construction
3.1.3 The Infinite-Width Limit
3.1.4 r= 0Implies Feature Learning
3.2 Proof of Maximal Update Limit
3.3 Proof of Neural Tangent Limit
4 Proof of Master Theorem
4.1 Basic Tools
4.1.1 Review of Moore-Penrose Pseudo-Inverse
4.1.2 Baranyai’s Theorem
4.2 Basic Objects and Operations
4.2.1 Fixed-Dimensional Random Variables and Vectors
4.2.2 Space of Random Sequences
4.2.3 Multi-Tensors
4.2.4 Constant Tensors
4.2.5 IID Tensors
4.2.6 Averaging over n
4.2.7 Implicit Broadcasting of Nonlinearities on Multi-Tensors
4.2.8 Nonlinear Outer Products
4.3 Vanishing and Bounded Moments
4.3.1 Moment-Bounded Multi-Tensors
4.3.2 Vanishing Multi-Tensors
4.3.3 Equivalence Modulo Vanishing Multi-Tensors
4.3.4 Distributional Equivalence (aka Dequivalence)
4.4 Getting Equivalence From Distributional Equivalence
4.4.1 Proof of Lemma 4.4.5
4.5 Matrix Pseudo-Inverse
4.6 Uniformized Tensor Programs
4.6.1 Uniformized N ETSOR ⊤
4.6.2 Uniformized N E⊗OR⊤
4.6.3 Setup and Constructions
4.7 N ETSOR ⊤Master Theorem, Vectorwise Convergence
4.7.1 Proof Setup
4.7.2 Proof Plan
4.7.3 Exact Formulas
4.7.4 Showing 3
4.7.5 Showing 2
4.7.6 Showing 1
4.8 N E⊗OR⊤Master Theorem, Vectorwise Convergence
4.9 Proof of N E⊗OR⊤Master Theorem Theorem 2.6.10
5 Experiments
5.1 Numerical Verification
Summary
Going beyond stochastic gradient descent (SGD), this paper explores the new phenomena emerging in wide neural networks trained by adaptive optimizers like Adam. The study reveals that a dichotomy between feature learning and kernel behaviors, observed in SGD, also applies to general optimizers such as Adam, albeit with a nonlinear notion of 'kernel'. The research introduces the 'neural tangent' and 'maximal update' limits for any architecture, with the foundation laid upon a new Tensor Program language, NE⊗OR⊤, and the utilization of bra-ket notation borrowed from quantum physics to simplify expressions and calculations. The work summarizes and extends previous results in the Tensor Programs series, providing insights into the behavior of ∞-Width Neural Networks Trained by Adaptive Optimizers.