Summary
The document discusses the architectural universality of the Neural Tangent Kernel (NTK) behavior during training dynamics. It shows that neural networks follow a kernel gradient descent dynamics in function space during training. The Tensor Programs technique is applied to analyze the SGD dynamics. The paper introduces a new graphical notation for Tensor Programs and demonstrates the universality of the NTK theory for various architectures. The analysis involves examples with 1-hidden-layer and 2-hidden-layers MLP architectures. The results are obtained by unrolling SGD updates and understanding the interactions of weight matrices undergoing training.