Guided Captioning of Audio

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

6 Downloads (Pure)

Abstract

This work introduces a guided captioning system that aims to produce captions focused on different audio content, depending on a guiding text. We show that using keywords guidance results in more diverse captions, even though the usual captioning metrics do not reflect this. We design a system that can be trained using keywords automatically extracted from reference annotations, and which is provided with one keyword at test time. When trained with 5 keywords, the produced captions contain the exact guidance keyword 70% of the time, and results in over 3600 unique sentences for Clotho dataset. In contrast, a baseline without any keywords produces 700 unique captions on the same test set.
Original languageEnglish
Title of host publicationProceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)
PublisherDCASE
Pages71-75
ISBN (Electronic)978-952-03-3171-9
Publication statusPublished - 2024
Publication typeA4 Article in conference proceedings
EventWorkshop on Detection and Classification of Acoustic Scenes and Events - Tokyo, Japan
Duration: 23 Oct 202425 Oct 2024
https://dcase.community/workshop2024/

Workshop

WorkshopWorkshop on Detection and Classification of Acoustic Scenes and Events
Abbreviated titleDCASE2024
Country/TerritoryJapan
CityTokyo
Period23/10/2425/10/24
Internet address

Publication forum classification

  • Publication forum level 1

Fingerprint

Dive into the research topics of 'Guided Captioning of Audio'. Together they form a unique fingerprint.

Cite this