Multimodal Contrastive Learning with LIMoE: The Language-Image Mixture of Experts
By Basil Mustafa et al
Published on June 6, 2022
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Multimodal Mixture of Experts
2.1 Multimodal contrastive learning
2.2 The LIMoE Architecture
2.2.1 Challenges for multimodal MoEs
2.2.2 Auxiliary losses
2.2.3 Priority routing
3 Experiments
3.1 Controlled study across scales
3.2 Scaling up LIMoE
Summary
Large sparsely-activated models have obtained excellent performance in multiple domains. The Language-Image MoE (LIMoE) is a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. We propose LIMoE, the first large-scale multimodal mixture of experts models. We demonstrate in detail how prior approaches to regularising mixture of experts models fall short for multimodal learning, and propose a new entropy-based regularisation scheme to stabilise training. LIMoE offers strong improvements across all scales from S/32, up to L/16. The model achieves 84.1% zero-shot ImageNet classification accuracy, competitive with state-of-the-art contrastive models with per-modality backbones and pre-training.