A curated dataset of urban scenes for audio-visual scene analysis

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

33 Citations (Scopus)
26 Downloads (Pure)

Abstract

This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.

Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherIEEE
Pages626-630
Number of pages5
ISBN (Electronic)978-1-7281-7605-5
DOIs
Publication statusPublished - 2021
Publication typeA4 Article in conference proceedings
EventIEEE International Conference on Acoustics, Speech and Signal Processing - Metro Toronto Convention Centre, Toronto, Canada
Duration: 6 Jun 202111 Jun 2021
https://2021.ieeeicassp.org

Publication series

NameProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
ISSN (Print)1520-6149

Conference

ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing
Country/TerritoryCanada
CityToronto
Period6/06/2111/06/21
Internet address

Keywords

  • Acoustic scene
  • Audio-visual data
  • Pattern recognition
  • Scene analysis
  • Transfer learning

Publication forum classification

  • Publication forum level 1

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'A curated dataset of urban scenes for audio-visual scene analysis'. Together they form a unique fingerprint.

Cite this