Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Preliminaries: Probing Multimodal Feature Space
3. Proposed Method: Retrieval-then-Optimization
Summary
This paper presents a novel method for pre-training text-to-image generation models on image-only datasets. The method involves a retrieval-then-optimization procedure to synthesize pseudo text features for better alignment with images. The proposed method, LAFITE 2, demonstrates good transferability in various scenarios, including few-shot, semi-supervised, and fully-supervised text-to-image generation. Extensive experiments show the effectiveness of the approach, achieving state-of-the-art results on fully-supervised text-to-image generation tasks.