Language-based machine perception: linguistic perspectives on the compilation of captioning datasets

Research output: Contribution to journalArticleScientificpeer-review

1 Citation (Scopus)
39 Downloads (Pure)

Abstract

Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affect the form and content of the captions, such as contextualization and perspectivation. With a qualitative content analysis, we examine annotator instructions with a selection of eleven datasets. Drawing from various theoretical frameworks that help assess the effectiveness of the instructions, we discuss the visual and textual presentation of the instructions, as well as the perspective-guidance that is an essential part of the language instructions. While our analysis indicates that some standards in the formulation of instructions seem to have formed in the field, we also identified various reoccurring issues potentially hindering readability and comprehensibility of the instructions, and therefore, caption quality. To enhance readability, we emphasize the importance of text structure, organization of the information, consistent use of typographical cues, and clarity of language use. Last, engaging with previous research, we assess the compilation of both web-sourced and crowdsourced captioning datasets from various perspectives, discussing factors affecting the diversity of the datasets.
Original languageEnglish
Pages (from-to)864-883
Number of pages20
JournalDigital Scholarship in the Humanities
Volume39
Issue number3
DOIs
Publication statusPublished - Sept 2024
Publication typeA1 Journal article-refereed

Publication forum classification

  • Publication forum level 2

Fingerprint

Dive into the research topics of 'Language-based machine perception: linguistic perspectives on the compilation of captioning datasets'. Together they form a unique fingerprint.

Cite this