Multimodal Learning with Transformers: A Survey

By Peng Xu et al
Published on May 10, 2023
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Background
3. Transformers
3.1 Vanilla Transformer
3.1.1 Input Tokenization
3.1.2 Multi-Head Self-Attention
3.1.3 Position-Wise Feed-Forward Network
4. Scope
5. Related Surveys
6. Features
7. Contributions

Summary

This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. It discusses the landscape of multimodal learning, Transformer ecosystem, and the multimodal big data era. The survey delves into the designs of Vanilla Transformer, Vision Transformer, and multimodal Transformers. It reviews the key components and mathematical formulations of these models in a multimodal context. The paper also provides a taxonomy for Transformer-based multimodal machine learning, covering applications, challenges, and current research directions. Overall, it offers a structured overview of the field to enable researchers to grasp the advancements and challenges in multimodal learning with Transformers.
×
This is where the content will go.