MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

By S. Ge et al
Published on Nov. 13, 2023
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Approach
2.1 Initialization Model and Instruction Tuning Seed
2.2 Jailbreaking with Adversarial LLM
2.3 Feedback Guided Safety Finetuning
2.4 Iteratively Training Madv and Mtgt

Summary

The paper proposes a Multi-round Automatic Red-Teaming (MART) method to enhance the safety of Large Language Models (LLMs) by combining automatic adversarial prompt writing and safe response generation. It introduces a framework where an adversarial LLM and a target LLM interact iteratively to improve safety and scalability. The paper discusses the initialization model, training the adversarial LLM, feedback-guided safety finetuning, and the iterative training process to adapt to model updates and vulnerabilities.
×
This is where the content will go.