MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
By S. Ge et al
Published on Nov. 13, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. Approach
2.1 Initialization Model and Instruction Tuning Seed
2.2 Jailbreaking with Adversarial LLM
2.3 Feedback Guided Safety Finetuning
2.4 Iteratively Training Madv and Mtgt
Summary
The paper proposes a Multi-round Automatic Red-Teaming (MART) method to enhance the safety of Large Language Models (LLMs) by combining automatic adversarial prompt writing and safe response generation. It introduces a framework where an adversarial LLM and a target LLM interact iteratively to improve safety and scalability. The paper discusses the initialization model, training the adversarial LLM, feedback-guided safety finetuning, and the iterative training process to adapt to model updates and vulnerabilities.