Datacomp: In Search of the Next Generation of Multimodal Datasets

By Samir Yitzhak Gadre et al
Published on Oct. 20, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1 Introduction
2 Related Work
3 The Datacomp Benchmark
3.1 Competition design Overview
3.2 Common Pool generation, for the filtering track
3.3 The Bring Your Own Data (BYOD) track

Summary

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. To address the lack of research attention on dataset design, the authors introduce DATACOMP, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in the benchmark design new filtering techniques or curate new data sources and evaluate their datasets using standardized CLIP training code. The benchmark spans multiple compute scales, enabling the study of scaling trends. The results show that the DATACOMP workflow leads to better training sets, with DATACOMP-1B enabling training a CLIP ViT-L/14 model to 79.2% zero-shot accuracy on ImageNet. The paper discusses related work, dataset generation, competition design, and tracks for filtering and BYOD.
×
This is where the content will go.