Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.
Original languageEnglish
Title of host publicationProceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020)
EditorsNobutaka Ono, Noboru Harada, Yohei Kawaguchi, Annamaria Mesaros, Keisuke Imoto, Yuma Koizumi, Tatsuya Komatsu
Pages110-114
ISBN (Electronic)978-4-600-00566-5
Publication statusPublished - 2020
Publication typeA4 Article in a conference publication
EventWorkshop on Detection and Classification of Acoustic Scenes and Events - Tokyo, Japan
Duration: 2 Nov 20203 Nov 2020
http://dcase.community/workshop2020/

Workshop

WorkshopWorkshop on Detection and Classification of Acoustic Scenes and Events
Abbreviated titleDCASE 2020
Country/TerritoryJapan
CityTokyo
Period2/11/203/11/20
Internet address

Keywords

  • audio captioning
  • recurrent neural networks
  • temporal sub-sampling
  • hierarchical sub-sampling networks

Publication forum classification

  • Publication forum level 0

Fingerprint

Dive into the research topics of 'Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning'. Together they form a unique fingerprint.

Cite this