Vlm O: Unified Vision-Language Pre-Training With Mixture-Of-Modality-Experts
By Hangbo Bao et al
Published on May 27, 2022
Read the original document by opening this link in a new tab.
Table of Contents
1. Abstract
2. Introduction
3. Related Work
4. Methods
5. Pre-Training Tasks
6. Stagewise Pre-Training
7. Fine-Tuning VLM O on Downstream Tasks
8. Experiments
Summary
The document introduces a unified Vision-Language pretrained Model (VLM O) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. It proposes a Mixture-of-Modality-Experts (MOME) Transformer for encoding various modalities within a Transformer block. The model is pre-trained using image-text contrastive learning, masked language modeling, and image-text matching tasks. Additionally, a stagewise pre-training strategy is employed using large-scale image-only and text-only data. The model can be fine-tuned for various vision-language retrieval and classification tasks.