Activities per year
Abstract
Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.
Original language | English |
---|---|
Title of host publication | Proceedings of the Fifth Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020) |
Editors | Nobutaka Ono, Noboru Harada, Yohei Kawaguchi, Annamaria Mesaros, Keisuke Imoto, Yuma Koizumi, Tatsuya Komatsu |
Publisher | Tokyo Metropolitan University |
Pages | 110-114 |
ISBN (Electronic) | 978-4-600-00566-5 |
Publication status | Published - 2020 |
Publication type | A4 Article in conference proceedings |
Event | Workshop on Detection and Classification of Acoustic Scenes and Events - Tokyo, Japan Duration: 2 Nov 2020 → 3 Nov 2020 http://dcase.community/workshop2020/ |
Workshop
Workshop | Workshop on Detection and Classification of Acoustic Scenes and Events |
---|---|
Abbreviated title | DCASE 2020 |
Country/Territory | Japan |
City | Tokyo |
Period | 2/11/20 → 3/11/20 |
Internet address |
Keywords
- audio captioning
- recurrent neural networks
- temporal sub-sampling
- hierarchical sub-sampling networks
Publication forum classification
- Publication forum level 0
Fingerprint
Dive into the research topics of 'Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning'. Together they form a unique fingerprint.Activities
- 1 Supervisor of bachelor student
-
Sequence Temporal Sub-Sampling for Automated Audio Captioning
Drosos, K. (Examiner)
2020Activity: Evaluation, examination and supervision › Supervisor of bachelor student