Meta-Transformer: A Unified Framework for Multimodal Learning
By Y. Zhang et al
Published on July 20, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Related Work
3. Meta-Transformer
3.1 Preliminary
3.2 Data-to-Sequence Tokenization
3.3 Unified Encoder
Summary
Meta-Transformer proposes a unified framework for multimodal learning, leveraging a frozen encoder to process 12 modalities without paired training data. It introduces a novel approach to unify multiple modalities using the same set of parameters, enabling cohesive multimodal learning. The framework consists of a data-to-sequence tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks. Extensive experiments demonstrate the superior performance of Meta-Transformer across various datasets.