Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Related Work
3 Approach
4 Experiments
Summary
In this paper, the authors present a self-supervised method for representation learning utilizing two different modalities. The method aims to exploit cross-modal information effectively to train powerful feature representations. By utilizing a two-stream architecture with trainable CNNs and introducing cross-modal and diversity loss contributions, the authors demonstrate state-of-the-art performance on action recognition datasets. Extensive experimental validation and ablation studies validate the effectiveness of the proposed method.