Abstract
Describing soundscapes in sentences allows better understanding of the acoustic scene than a single label indicating the acoustic scene class or a set of audio tags indicating the sound events active in the audio clip. In addition, the richness of natural language allows a range of possible descriptions for the same acoustic scene. In this work, we address the diversity obtained when collecting descriptions of soundscapes using crowdsourcing. We study how much the collection of audio captions can be guided by the instructions given in the annotation task, by analysing the possible bias introduced by auxiliary information provided in the annotation process. Our study shows that even when given hints on the audio content, different annotators describe the same soundscape using different vocabulary. In automatic captioning, hints provided as audio tags represent grounding textual information that facilitates guiding the captioning output towards specific concepts. We also release a new dataset of audio captions and audio tags produced by multiple annotators for a subset of the TAU Urban Acoustic Scenes 2018 dataset, suitable for studying guided captioning.
Original language | English |
---|---|
Title of host publication | Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021) |
Editors | Frederic Font, Annamaria Mesaros, Daniel P.W. Ellis, Eduardo Fonseca, Magdalena Fuentes, Benjamin Elizalde |
Publisher | DCASE |
Pages | 90-94 |
ISBN (Electronic) | 978-84-09-36072-7 |
DOIs | |
Publication status | Published - 15 Nov 2021 |
Publication type | A4 Article in conference proceedings |
Event | Detection and Classication of Acoustic Scenes and Events - , Spain Duration: 15 Nov 2021 → 19 Nov 2021 |
Conference
Conference | Detection and Classication of Acoustic Scenes and Events |
---|---|
Country/Territory | Spain |
Period | 15/11/21 → 19/11/21 |
Publication forum classification
- Publication forum level 0
Fingerprint
Dive into the research topics of 'Diversity and bias in audio captioning datasets'. Together they form a unique fingerprint.Datasets
-
MACS - Multi-Annotator Captioned Soundscapes
Morato, I. M. (Creator) & Mesaros, A. (Creator), Zenodo, 22 Jul 2021
Dataset
-
SiVi-CAFE dataset - Sighted and Visually-impaired Captions for Audio in Finnish and English
Martin Morato, I. (Creator), Harju, M. (Creator), Hirvonen, M. (Creator) & Mesaros, A. (Creator), Zenodo, 6 Jun 2024
Dataset