Advertisement

Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based

阅读量:

本文属于LLM系列文章,并对《Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based Bias Evaluation》一文的研究工作进行了详细阐述与翻译工作

对大型语言模型进行评估,旨在通过文本分析强化刻板印象识别能力,并检验探究性偏见的存在。

  • 摘要
  • 1 引言
  • 2 相关工作
  • 3 方法
  • 4 结果与讨论
  • 5 结论和未来工作

摘要

Recent advances in large language models (LLMs) have significantly expanded their presence in human-centric artificial intelligence applications. However, LLMs may inadvertently replicate or exacerbate biases inherent in their training datasets. This study introduced the Multiverse Gender and Stereotype (MGS) dataset, which contains 51,867 instances encompassing gender, race, occupation, religion, and stereotypical text. These instances were compiled by integrating multiple publicly available stereotype detection datasets. The research explored various machine learning methodologies aimed at establishing a baseline for stereotype detection and fine-tuned several model architectures and configurations of language models. A series of stereotype detection classifiers trained on MGS were developed to analyze English text for stereotypical content. To ensure our detector aligns with human common sense, we employed explainable AI tools such as SHAP, LIME, and BertViz. A range of example case analyses was conducted to evaluate performance consistency. Additionally, we crafted stereotype-inspired prompts using our most effective detector from prior work and evaluated its performance using popular LLMs on text generation tasks to identify stereotypical outputs. Our experimental findings reveal several key insights: First, multi-dimensional training approaches yield better results than one-dimensional classifier training. Second, an ensemble of MGS datasets enhances the detector's within-dataset and cross-dataset generalization capabilities compared to single-dataset approaches. Thirdly, newer GPT versions exhibit a reduced occurrence of prototypical content.

1 引言

2 相关工作

3 方法

4 结果与讨论

5 结论和未来工作

总之,我们通过基于文本的刻板印象分类为LLM中的审计偏见框架奠定了基础。使用MGS数据集和微调的PLM,我们的方法超越了提出的基线,并证明了多个原型维度分类器优于单个原型维度分类器,以及使用MGS数据库优于单个原型数据集。为了验证我们的模型所做的决策,我们采用了SHAP、LIME和BertViz等XAI技术。基准测试结果进一步证实了GPT系列新版本中偏差的减少。
对于未来的工作,首先,我们的目标是发现多标签数据集和模型开发,以检测重叠的刻板印象,并评估它们对疗效的协同作用,超越目前的多类别方法。其次,我们计划扩大刻板印象类别,包括LGBTQ+(WinoQueer)和地区刻板印象。此外,受token级幻觉检测的启发,我们将发现token级刻板印象检测来提高分析粒度。

全部评论 (0)

还没有任何评论哟~