Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
By Stephen Casper et al
Published on Sept. 11, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Background and Notation
3 Open Problems and Limitations of RLHF
3.1 Challenges with Obtaining Human Feedback
3.1.1 Misaligned Humans: Evaluators may Pursue the Wrong Goals
3.1.2 Good Oversight is Difficult
Summary
Reinforcement learning from human feedback (RLHF) is a technique used to train AI systems to align with human goals. This paper surveys open problems and fundamental limitations of RLHF and related methods. It discusses challenges with obtaining human feedback, including misaligned evaluators and difficulties in oversight. The document emphasizes the importance of a multi-layered approach to the development of safer AI systems.