Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Audio Spectrogram Transformer
2.1. Model Architecture
2.2. ImageNet Pretraining
3. Experiments
3.1. AudioSet Experiments
3.1.1. Dataset and Training Details
3.1.2. AudioSet Results
3.1.3. Ablation Study
Summary
In this paper, the Audio Spectrogram Transformer (AST) is introduced as a convolution-free, purely attention-based model for audio classification. The AST architecture involves converting audio waveforms into log Mel filterbank features and processing them through a Transformer encoder for classification. The model achieves state-of-the-art results on various audio classification benchmarks. The paper also discusses the impact of ImageNet pretraining and positional embedding adaptation on model performance. Experimental results on AudioSet demonstrate the effectiveness of AST, outperforming previous systems based on CNN-attention hybrid models.