Read the original document by opening this link in a new tab.
Table of Contents
Abstract
1 Introduction
22 Methods
2.1 Scope
2.2 Base Models
2.3 Generator
2.4 Data Collection
2.5 Outcome-supervised Reward Models (ORMs)
2.6 Process-supervised Reward Models (PRMs)
3 Large-scale Supervision
4 Small-scale Synthetic Supervision
4.1 Process vs Outcome Supervision
Summary
Large language models have greatly improved in complex multi-step reasoning, but still make logical mistakes. Training more reliable models can be achieved through outcome or process supervision. Process supervision significantly outperforms outcome supervision for training models. Active learning enhances process supervision efficacy. PRM800K dataset is released to support related research. Models trained with process supervision are more reliable than outcome supervision models. Large-scale PRM shows superior performance over ORM and majority voting. Small-scale synthetic supervision experiments reveal PRM outperforms ORM. Direct comparison of outcome and process supervision shows the effectiveness of process supervision.