Abstract
This work introduces a guided captioning system that aims to produce captions focused on different audio content, depending on a guiding text. We show that using keywords guidance results in more diverse captions, even though the usual captioning metrics do not reflect this. We design a system that can be trained using keywords automatically extracted from reference annotations, and which is provided with one keyword at test time. When trained with 5 keywords, the produced captions contain the exact guidance keyword 70% of the time, and results in over 3600 unique sentences for Clotho dataset. In contrast, a baseline without any keywords produces 700 unique captions on the same test set.
Original language | English |
---|---|
Title of host publication | Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024) |
Publisher | DCASE |
Pages | 71-75 |
ISBN (Electronic) | 978-952-03-3171-9 |
Publication status | Published - 2024 |
Publication type | A4 Article in conference proceedings |
Event | Workshop on Detection and Classification of Acoustic Scenes and Events - Tokyo, Japan Duration: 23 Oct 2024 → 25 Oct 2024 https://dcase.community/workshop2024/ |
Workshop
Workshop | Workshop on Detection and Classification of Acoustic Scenes and Events |
---|---|
Abbreviated title | DCASE2024 |
Country/Territory | Japan |
City | Tokyo |
Period | 23/10/24 → 25/10/24 |
Internet address |
Publication forum classification
- Publication forum level 1