Image as a Foreign Language: BE IT Pretraining for All Vision and Vision-Language Tasks

By Wenhui Wang et al
Published on Aug. 31, 2022
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction: The Big Convergence
2 BE IT-3: A General-Purpose Multimodal Foundation Model
2.1 Backbone Network: Multiway Transformers
2.2 Pretraining Task: Masked Data Modeling
2.3 Scaling Up: BE IT-3 Pretraining
3 Experiments on Vision and Vision-Language Tasks
3.1 Vision-Language Downstream Tasks

Summary

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, a general-purpose multimodal foundation model BE IT-3 is introduced, achieving state-of-the-art transfer performance on both vision and vision-language tasks. The model is pre-trained using masked data modeling on images, texts, and image-text pairs. Experimental results demonstrate the model's superior performance across various benchmarks.
×
This is where the content will go.