TY - JOUR
T1 - Active Learning for Sound Event Detection
AU - Zhao, Shuyang
AU - Heittola, Toni
AU - Virtanen, Tuomas
N1 - Funding Information:
Manuscript received February 12, 2020; revised July 3, 2020 and August 6, 2020; accepted September 3, 2020. Date of publication October 8, 2020; date of current version November 5, 2020. This work was supported by the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Isabel Barbancho. (Corresponding author: Shuyang Zhao.) The authors are with the Faculty of Information Technology and Communication Sciences, Tampere University, 33720 Tampere, Finland (e-mail: shuyang.zhao@tuni.fi; toni.heittola@tuni.fi; tuomas.virtanen@tuni.fi). Digital Object Identifier 10.1109/TASLP.2020.3029652
Publisher Copyright:
© 2014 IEEE.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020
Y1 - 2020
N2 - This article proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation (TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2% of the training data, the achieved SED performance is similar to annotating all the training data.
AB - This article proposes an active learning system for sound event detection (SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation (TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2% of the training data, the achieved SED performance is similar to annotating all the training data.
KW - Active learning
KW - change point detection
KW - mismatch-first farthest-traversal
KW - sound event detection
KW - weakly supervised learning
U2 - 10.1109/TASLP.2020.3029652
DO - 10.1109/TASLP.2020.3029652
M3 - Article
AN - SCOPUS:85096362215
SN - 2329-9290
VL - 28
SP - 2895
EP - 2905
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -