Summary
This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. It discusses the landscape of multimodal learning, Transformer ecosystem, and the multimodal big data era. The survey delves into the designs of Vanilla Transformer, Vision Transformer, and multimodal Transformers. It reviews the key components and mathematical formulations of these models in a multimodal context. The paper also provides a taxonomy for Transformer-based multimodal machine learning, covering applications, challenges, and current research directions. Overall, it offers a structured overview of the field to enable researchers to grasp the advancements and challenges in multimodal learning with Transformers.