Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language

阅读量：

本文涉及LLM系列文章，专门针对《Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models》的翻译工作。

超越答案：考察大型语言模型评价中选择题的合理性

摘要
1 引言
2 相关工作
3 是否能通过MCQA格式任务的准确性能准确反映模型的真实能力？
4 虽然正确但并非唯一正确
5 大多数LLM倾向于从正确答案中提取知识

摘要

在NLP领域，大型语言模型（LLM）推动了研究范式的革新，显著提升了自然语言生成任务的性能水平。尽管取得了显著进展，但LLM的全面评估仍是 community 面临的不可避免的挑战。本研究探讨了MCQA作为LLM评估方法的合理性。如果LLM真正理解问题的语义，那么它们的性能在源自相同问题的各种配置中应该表现出一致性。与这一预期相反，我们的实证研究结果表明，LLM反应的一致性存在显著差异，我们将其定义为LLM的再反应VAriability Syndrome（REVAS）。这一发现表明，基于MCQA的评估基准可能无法充分捕捉LLM的真实能力，从而凸显了在评估LLM性能时需要更稳健的评估机制。

1 引言

2 相关工作

3 MCQA格式任务的准确性能否反映真实的模型能力？

4 更正确但不是唯一正确的

5 LLM大多数从正确中学习

全部评论 (0)

还没有任何评论哟~

Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language

本文是LLM系列文章，针对《BeyondtheAnswers:ReviewingtheRationalityofMultipleChoiceQuestionAnsweringfortheEvaluat...

Exploring the Impact of the Output Format on the Evaluation of Large Language Models

本文是LLM系列文章，针对《ExploringtheImpactoftheOutputFormatontheEvaluationof LargeLanguageModelsforCodeTransla...

Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models

文章目录题目摘要引言方法基线消融研究相关工作结论题目面向大型语言模型复杂问答的推理树问题分解论文地址：https://www.semanticscholar.org/paper/...

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

本文是LLM系列文章，针对《MedExpQA:MultilingualBenchmarkingofLargeLanguageModelsforMedicalQuestionAnswering》的翻译。

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

本文是LLM系列文章，针对《ThroughtheLensofCoreCompetency:SurveyonEvaluationofLargeLanguageModels》的翻译。

GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

本文是LLM系列文章，针对《GENRES:RethinkingEvaluationforGenerativeRelationExtractionintheEraofLargeLanguageModel...

2016Analyzing the Behavior of Visual Question Answering Models

摘要大多数模型性能大约在6070%，本文，我们提出系统的方法来分析这些模型的行为，作为识别优缺点和识别最有成果的方向的第一步。我们分析两种模型，一种是有注意力和没有注意力，并显示了这些模型行为的相似...

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

本文是LLM系列文章，针对《BenchmarkingtheTexttoSQLCapabilityofLargeLanguageModels:AComprehensiveEvaluation》的翻译。

Unlocking the Potential of Large Language Models for Explainable Recommendations

生成用户友好的关于推荐项目的原因的解释已经变得越来越普遍，这在很大程度上归功于语言生成技术的进步，这可以增强用户信任，并促进在使用在线服务时做出更明智的决策。然而，现有的可解释推荐系统侧重于使用小规模...

Tale of two language models: Revisiting the evaluation

作者：禅与计算机程序设计艺术 1.背景介绍在自然语言处理领域，机器翻译MachineTranslation,MT一直是一个具有重大影响力的研究方向，它的目的就是通过计算机自动地将一种语言的文本转换成...

是否确定退出登录?

Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language

超越答案：考察大型语言模型评价中选择题的合理性

摘要

1 引言

2 相关工作

3 MCQA格式任务的准确性能否反映真实的模型能力？

4 更正确但不是唯一正确的

5 LLM大多数从正确中学习

全部评论 (0)

相关文章推荐

Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language

Exploring the Impact of the Output Format on the Evaluation of Large Language Models

Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models

2016Analyzing the Behavior of Visual Question Answering Models

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Unlocking the Potential of Large Language Models for Explainable Recommendations

Tale of two language models: Revisiting the evaluation