Robust Speech Recognition via Large-Scale Weak Supervision

By Alec Radford et al.
Published on Dec. 6, 2022
Read the original document by opening this link in a new tab.

Table of Contents

Abstract
1. Introduction
2. Approach
2.1. Data Processing
2.2. Model
2.3. Multitask Format
2.4. Training Details

Summary

This paper discusses the capabilities of speech processing systems trained with weak supervision on a large scale of audio transcripts. The models trained on 680,000 hours of multilingual and multitask supervision show good performance on standard benchmarks without fine-tuning. The study emphasizes the importance of pre-trained decoders for speech recognition systems to work reliably across various datasets without the need for dataset-specific fine-tuning. The paper introduces a new approach called Whisper, which scales weakly supervised speech recognition to 680,000 hours of labeled audio data. The models are released to encourage further research in robust speech processing.
×
This is where the content will go.