Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

35 Downloads (Pure)

Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8\%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1%.
Original languageEnglish
Title of host publicationProceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2021)
EditorsFrederic Font, Annamaria Mesaros, Daniel P.W. Ellis, Eduardo Fonseca, Magdalena Fuentes, Benjamin Elizalde
PublisherDCASE
Pages45-49
ISBN (Electronic)978-84-09-36072-7
DOIs
Publication statusPublished - 15 Nov 2021
Publication typeA4 Article in conference proceedings
EventDetection and Classication of Acoustic Scenes and Events - , Spain
Duration: 15 Nov 202119 Nov 2021

Conference

ConferenceDetection and Classication of Acoustic Scenes and Events
Country/TerritorySpain
Period15/11/2119/11/21

Publication forum classification

  • Publication forum level 0

Fingerprint

Dive into the research topics of 'Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions'. Together they form a unique fingerprint.

Cite this