OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
By Peng Wang et al
Published on June 1, 2022
Read the original document by opening this link in a new tab.
Table of Contents
ABSTRACT
1. Introduction
2. Related Work
3. OFA
3.1 I/O & Architecture
3.2 Tasks & Modalities
3.3 Pretraining Datasets
3.4 Training & Inference
3.5 Scaling Models
Summary
In this work, the authors propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks in a simple sequence-to-sequence learning framework. The model architecture is based on Transformer and designed to handle tasks such as visual grounding, image captioning, and language modeling. OFA is pretrained on 20M publicly available image-text pairs and achieves state-of-the-art performance in various tasks. The authors emphasize the importance of unifying architectures, tasks, and modalities to achieve better generalization and performance in downstream tasks.