Open Sesame ! Universal Black Boxjailbreaking of Large Language Models

By R. Lapid et al
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Previous Work
3 Threat Model
4 Our Method
5 Experiments and Results

Summary

This paper introduces a novel approach to manipulate large language models using a genetic algorithm. The goal is to disrupt the alignment of these models with user intent, leading to potentially harmful outputs. The proposed technique involves crafting adversarial prompts that exploit model biases without the need for model internals. The study focuses on a black box jailbreak attack scenario, where access to the model's internal architecture is restricted. Through experiments and evaluations on a dataset of harmful behaviors, the efficacy of the approach is demonstrated. The findings contribute to the ongoing discussion on responsible AI development and highlight the challenges in safeguarding language models against adversarial attacks.
×
This is where the content will go.