Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Emu: Predict the Next in Multimodality
2.1 Architecture
2.2 Training Objective
2.3 Generalist Interface
3 Emu Training
3.1 Data
3.2 Pretraining
3.3 Visual Decoding
Summary
Generative Pretraining in Multimodality presents Emu, a Transformer-based multimodal foundation model that can seamlessly generate images and texts in multimodal context. Emu is trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. It empowers the exploration of diverse pretraining data sources at scale and demonstrates superb performance in various tasks including image captioning, visual question answering, and text-to-image generation.