Generative Pretraining in Multimodality

By Quan Sun et al
Published on July 11, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Emu: Predict the Next in Multimodality
2.1 Architecture
2.2 Training Objective
2.3 Generalist Interface
3 Emu Training
3.1 Data
3.2 Pretraining
3.3 Visual Decoding

Summary

Generative Pretraining in Multimodality presents Emu, a Transformer-based multimodal foundation model that can seamlessly generate images and texts in multimodal context. Emu is trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. It empowers the exploration of diverse pretraining data sources at scale and demonstrates superb performance in various tasks including image captioning, visual question answering, and text-to-image generation.
×
This is where the content will go.