A summarization approach to evaluating audio captioning

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

8 Downloads (Pure)


Audio captioning is currently evaluated with metrics originating from machine translation and image captioning, but their suitability for audio has recently been questioned. This work proposes content-based scoring of audio captions, an approach that considers the specific sound events content of the captions. Inspired from text summarization, the proposed measure gives relevance scores to the sound events present in the reference, and scores candidates based on the relevance of the retrieved sounds. In this work we use a simple, consensus-based definition of relevance, but different weighing schemes can be easily incorporated to change the importance of terms accordingly. Our experiments use two datasets and three different audio captioning systems and show that the proposed measure behaves consistently with the data: captions that correctly capture the most relevant sounds obtain a score of 1, while the ones containing less relevant sounds score lower. While the proposed content-based score is not concerned with the fluency or semantic content of the captions, it can be incorporated into a compound metric, similar to SPIDEr being a linear combination of a semantic and a syntactic fluency score.
Original languageEnglish
Title of host publicationProceedings of the 7th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2022)
EditorsMathieu Lagrange, Annamaria Mesaros, Thomas Pellegrini, Gaël Richard, Romain Serizel, Dan Stowell
ISBN (Electronic)978-952-03-2677-7
Publication statusPublished - 3 Nov 2022
Publication typeA4 Article in conference proceedings
EventWorkshop on Detection and Classification of Acoustic Scenes and Events - Nancy, France
Duration: 3 Nov 20224 Nov 2022


ConferenceWorkshop on Detection and Classification of Acoustic Scenes and Events
Abbreviated titleDCASE
Internet address

Publication forum classification

  • Publication forum level 1


Dive into the research topics of 'A summarization approach to evaluating audio captioning'. Together they form a unique fingerprint.

Cite this