Experimental Design of Extractive Question-Answering Systems: Influence of Error Scores and Answer Length

Research output: Contribution to journalArticleScientificpeer-review

3 Downloads (Pure)

Abstract

Question-answering (QA) systems are becoming more and more important because they enable human-computer communication in a natural language. In recent years, significant progress has been made with transformer-based models that leverage deep learning in combination with large amounts of text data. However, a significant challenge with QA systems lies in their complexity rooted in the ambiguity and flexibility of a natural language. This makes even their evaluation a formidable task. For this reason, in this study, we focus on the evaluation of extractive question-answering (EQA) systems by conducting a large-scale analysis of distilBERT using benchmark data provided by the Stanford Question Answering Dataset (SQuAD). Specifically, the main objectives of this paper are fourfold. First, we study the influence of the answer length on the performance and we demonstrate that there is an inverse correlation between both. Second, we study differences in exact match (EM) measures because there are different definitions commonly used in the literature. As a result, we find that despite the fact that all of those measures are named”exact match” these measures are actually different from each other. Third, we study the practical relevance of these different definitions because due to the ambivalent meaning of”exact match” in the literature, it is often unclear if reported improvements are genuine or only due to a change in the exact match measure. Importantly, our results show that differences between differently defined EM measures are in the same order of magnitude as reported differences found in the literature. This raises concerns about the robustness of reported results. Fourth, we provide guidelines to improve the experimental design of general EQA studies, aiming to enhance performance evaluation and minimize the potential for spurious results.

Original languageEnglish
Pages (from-to)87-125
Number of pages39
JournalJournal of Artificial Intelligence Research
Volume80
DOIs
Publication statusPublished - 2024
Publication typeA1 Journal article-refereed

Publication forum classification

  • Publication forum level 3

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Experimental Design of Extractive Question-Answering Systems: Influence of Error Scores and Answer Length'. Together they form a unique fingerprint.

Cite this