On the Expressivity Role of Layernorm in Transformers’ Attention

By Shaked Brodyy et al.
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Decomposing LayerNorm
3 Expressivity Role in Attention
3.1 Projection
3.2 Scaling
4 Experimental Results
4.1 Computing Majority
4.2 Unselectable Keys
5 Conclusion
6 Limitations
Acknowledgements
References

Summary

Layer Normalization (LayerNorm) plays a crucial role in the expressivity of the multi-head attention layer in Transformers. This paper decomposes LayerNorm into two components: projection and scaling. Projection helps in creating queries that attend to all keys equally, while scaling prevents the problem of 'unselectable' keys. The experimental results demonstrate that LayerNorm aids in tasks like computing 'majority' and eliminates the issue of unselectable keys. The authors emphasize the importance of LayerNorm in Transformers' attention mechanism and provide code for further exploration.
×
This is where the content will go.