Large-Scale Multi-Modal Pre-Trained Models: A Comprehensive Survey
By Xiao Wang et al
Published on April 10, 2024
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Background
3 Multi-Modal Pre-training
- Task Definition and Key Challenges
- Advantages of MM-PTMs
- Pre-training
Summary
This paper provides a comprehensive survey of large-scale multi-modal pre-trained models. It discusses the urgent demand for generalized deep models and the success of models like BERT, ViT, GPT in single domains. The survey covers the background of multi-modal pre-training, key challenges in acquiring and cleaning multi-modal data, designing network architectures, pre-training objectives, and the importance of computing power. It also highlights the advantages of multi-modal pre-trained models in addressing practical application scenarios and extracting common features across modalities. Overall, the paper aims to provide new insights and support researchers in tracking cutting-edge developments in multi-modal pre-training.