Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
By Mahmoud Assran et al
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Background
3. Method
4. Related Work
Summary
This paper presents a method for self-supervised learning from images using the Image-based Joint-Embedding Predictive Architecture (I-JEPA). The approach focuses on learning highly semantic image representations without the need for hand-crafted data augmentations. By predicting representations of target blocks in the same image from a single context block, I-JEPA aims to produce semantic features through a non-generative approach. The method emphasizes the importance of a masking strategy to guide the model towards generating semantic representations. Empirical results show that I-JEPA, when combined with Vision Transformers, achieves strong performance on various tasks without the reliance on view data augmentations during pretraining. The paper discusses the architecture, targets, context, prediction process, loss function, and compares I-JEPA with related works, demonstrating its efficiency and scalability.