Robust Speech Recognition via Large-Scale Weak Supervision
By Alec Radford et al.
Published on Dec. 6, 2022
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1. Introduction
2. Approach
2.1. Data Processing
2.2. Model
2.3. Multitask Format
2.4. Training Details
Summary
This paper discusses the capabilities of speech processing systems trained with weak supervision on a large scale of audio transcripts. The models trained on 680,000 hours of multilingual and multitask supervision show good performance on standard benchmarks without fine-tuning. The study emphasizes the importance of pre-trained decoders for speech recognition systems to work reliably across various datasets without the need for dataset-specific fine-tuning. The paper introduces a new approach called Whisper, which scales weakly supervised speech recognition to 680,000 hours of labeled audio data. The models are released to encourage further research in robust speech processing.