Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
By Paul Pu Liang et al
Published on Oct. 10, 2022
Read the original document by opening this link in a new tab.
Table of Contents
1 INTRODUCTION
2 FOUNDATIONAL PRINCIPLES IN MULTIMODAL RESEARCH
2.1 Principle 1: Modalities are Heterogeneous
2.2 Principle 2: Modalities are Connected
2.3 Principle 3: Modalities Interact
Summary
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. This paper provides an overview of the computational and theoretical foundations of multimodal machine learning, defining key principles and proposing a taxonomy of core technical challenges. The document discusses the challenges posed by the heterogeneity of data sources and interconnections found between modalities in multimodal research. It also presents a taxonomy of six core technical challenges encompassing representation, alignment, reasoning, generation, transference, and quantification, covering historical and recent trends in multimodal learning.