Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
By Sheng Shen et al
Published on July 5, 2023
Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
2 Method
2.1 Model Architecture
2.2 Instruction Fine-tuning Recipe
3 Experiment
3.1 Settings
3.2 Controlled study across scales
3.3 Learning efficiency comparison
3.4 Routing Strategy
4 Arch:Type: 10¤1100101 GFlops Per Token Prediction
5 Avg. Held-Out Score MMLU-Direct 0Shot
6 Avg. Held-Out Score BBH-Direct 0Shot
Summary
The document discusses the benefits of Sparse Mixture-of-Experts (MoE) in enhancing Large Language Models (LLMs) with instruction tuning technique. It highlights that MoE models benefit more from instruction tuning than dense models in various experimental setups. The FLAN-MOE model surpasses the performance of FLAN-PALM on benchmark tasks with lower FLOPs. The paper emphasizes the importance of instruction-tuning in improving the performance of MoE models on downstream tasks and held-out tasks. It presents a comprehensive series of experiments showing the superiority of FLAN-MOE over dense counterparts in different settings. The model architecture leverages sparsely activated MoE layers for computational efficiency and resource allocation. The instruction fine-tuning recipe includes sequence length adaptation, dropout rates, and learning rate settings. The experiment section evaluates FLAN-MOE models in instruction-tuning contexts, demonstrating enhanced performance levels across various benchmarks. Controlled studies across different scales show that FLAN-MOE models dominate dense models in terms of cost-performance, particularly in zero-shot and few-shot scenarios. The performance of FLAN-MOE models scales with the number of experts but tends to saturate beyond a certain threshold. The document also delves into the impact of routing strategies on instruct fine-tuning performance, showcasing the effectiveness of activating more experts. Overall, the paper establishes the efficacy of instruction-tuning in enhancing MoE models for language processing tasks.