Lambda Networks: Modeling Long-Range Interactions Without Attention
By Irwan Bello et al
Published on Feb. 17, 2021
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Modeling Long-Range Interactions
3 Lambda Layers
3.1 Lambda layer: transforming contexts into linear functions
3.2 A multi-query formulation to reduce complexity
4 Related Work
5 Experiments
5.1 Lambda layers outperform convolutions and attention layers
5.2 Computational benefits of lambda layers over self-attention
5.3 Hybrids improve the speed-accuracy tradeoff of image classification
5.4 Object detection and instance segmentation results
6 Discussion
A Practical Modeling Recommendations
B Additional Variants
B.1 Complete code with lambda convolution
B.2 Generating lambdas from masked contexts
B.3 Multi-head vs multi-query lambda layers
B.4 Adding expressivity with an extra dimension
C Additional Related Work
C.1 Softmax attention
C.2 Sparse attention
C.3 Linear attention: connections and differences
C.4 Casting channel and spatial attention as lambda layers
C.5 Self-Attention in the visual domain
C.6 Connections to HyperNetworks and expert models
D Additional Experiments
D.1 Ablation study
D.2 Hybrid models study
D.3 Computational efficiency results
E Experimental Details
E.1 Architectural details
E.2 Training details
Summary
The paper introduces lambda layers as an alternative framework to self-attention for capturing long-range interactions between input and structured contextual information. Lambda layers transform available contexts into linear functions, termed lambdas, and apply these linear functions to each input separately. The resulting LambdaNetworks outperform convolutional and attentional counterparts on various tasks, while being computationally efficient. The paper presents experiments and discussions on the benefits of lambda layers, hybrid models, and LambdaResNets, achieving significant speed-ups and accuracy improvements over existing models.