Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems

By Y. Wan et al.
Read the original document by opening this link in a new tab.

Table of Contents

1 Introduction
2 Background
2.1 Biases in Dialogue Models
2.2 Persona Biases in Dialogue Systems
3 UNIVERSAL PERSONA Collection
4 Method
4.1 Re-Defining Persona Biases
4.2 Evaluation Methods
4.2.1 Biases in Harmful Expression
4.2.2 Biases in Harmful Agreement
5 Conclusion

Summary

Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. We define generic personas to represent demographic groups, such as 'an Asian person', whereas specific personas may take the form of specific popular Asian names like 'Yumi'. While the adoption of personas enriches user experiences by making dialogue systems more engaging and approachable, it also casts a shadow of potential risk by exacerbating social biases within model responses, thereby causing societal harm through interactions with users. In this paper, we systematically study 'persona biases' - the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt. We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Through benchmarking on different models, our study uncovers significant persona biases in dialogue systems. Our findings underscore the pressing need to revisit the use of personas in dialogue agents to ensure safe application.
×
This is where the content will go.