Abstract
Describing audio content is a complex task for an annotator; the resulting caption depends on the annotator’s language, culture and expertise. In addition, physiological factors like vision impairment may affect on how the sound is perceived and interpreted. In this work, we explore bilingual audio captioning in Finnish and English. In connection with this study, we release the SiVi-CAFE dataset, a small-size dataset of Sighted and Visually-impaired Captions for Audio in Finnish and English, with a collection of parallel annotations for the same clips. We analyze briefly the differences between captions produced by sighted and visually-impaired annotators, and train a system to produce captions in both languages that also mimics the style of different annotator groups. Obtaining a CIDEr score of 34.75% and 28.75% on the English and Finnish datasets, respectively. Furthermore, the system is able to perform a tagging task, obtaining F-score of 79.73%.
Original language | English |
---|---|
Title of host publication | Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024) |
Publisher | DCASE |
Pages | 76-80 |
ISBN (Electronic) | 978-952-03-3171-9 |
Publication status | Published - 2024 |
Publication type | A4 Article in conference proceedings |
Event | Workshop on Detection and Classification of Acoustic Scenes and Events - Tokyo, Japan Duration: 23 Oct 2024 → 25 Oct 2024 https://dcase.community/workshop2024/ |
Workshop
Workshop | Workshop on Detection and Classification of Acoustic Scenes and Events |
---|---|
Abbreviated title | DCASE2024 |
Country/Territory | Japan |
City | Tokyo |
Period | 23/10/24 → 25/10/24 |
Internet address |
Publication forum classification
- Publication forum level 1