Summary
Layer Normalization (LayerNorm) plays a crucial role in the expressivity of the multi-head attention layer in Transformers. This paper decomposes LayerNorm into two components: projection and scaling. Projection helps in creating queries that attend to all keys equally, while scaling prevents the problem of 'unselectable' keys. The experimental results demonstrate that LayerNorm aids in tasks like computing 'majority' and eliminates the issue of unselectable keys. The authors emphasize the importance of LayerNorm in Transformers' attention mechanism and provide code for further exploration.