OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

By Peng Wang et al
Published on June 1, 2022
Read the original document by opening this link in a new tab.

Table of Contents

ABSTRACT
1. Introduction
2. Related Work
3. OFA
3.1 I/O & Architecture
3.2 Tasks & Modalities
3.3 Pretraining Datasets
3.4 Training & Inference
3.5 Scaling Models

Summary

In this work, the authors propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks in a simple sequence-to-sequence learning framework. The model architecture is based on Transformer and designed to handle tasks such as visual grounding, image captioning, and language modeling. OFA is pretrained on 20M publicly available image-text pairs and achieves state-of-the-art performance in various tasks. The authors emphasize the importance of unifying architectures, tasks, and modalities to achieve better generalization and performance in downstream tasks.
×
This is where the content will go.