WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

An Tran, Konstantinos Drossos, Tuomas Virtanen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

5 Citations (Scopus)

Abstract

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the temporal and time-frequency information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2 (higher is better).
Original languageEnglish
Title of host publication2021 29th European Signal Processing Conference (EUSIPCO)
PublisherIEEE
Pages576-580
Number of pages5
ISBN (Electronic)978-9-0827-9706-0
DOIs
Publication statusPublished - 2021
Publication typeA4 Article in conference proceedings
EventEuropean Signal Processing Conference - Dublin, Ireland
Duration: 23 Aug 202127 Aug 2021

Publication series

NameEuropean Signal Processing Conference
ISSN (Electronic)2076-1465

Conference

ConferenceEuropean Signal Processing Conference
Abbreviated titleEUSIPCO 2021
Country/TerritoryIreland
CityDublin
Period23/08/2127/08/21

Keywords

  • Measurement
  • Time-frequency analysis
  • Neural networks
  • Europe
  • Transformers
  • Encoding
  • Decoding
  • automated audio captioning
  • wavetransformer
  • wavenet
  • transformer

Publication forum classification

  • Publication forum level 1

Fingerprint

Dive into the research topics of 'WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information'. Together they form a unique fingerprint.

Cite this