Abstract
The objective of the thesis is to develop techniques that optimize the performances of sound event detection and classification systems at minimal supervision cost. The state-of-the-art sound event detection and classification systems use acoustic models developed using machine learning techniques. The training of acoustic models typically relies on a large amount of labeled audio data. Manually assigning labels to audio data is often the most time-consuming part in a model development process. Unlabeled data is abundant in many practical cases, but the amount of annotations that can be made is limited. Thus, the practical problem is optimizing the accuracies of acoustic models with a limited amount of annotations.
In this thesis, we started with the idea of clustering unlabeled audio data. Clustering results can be used to derive propagated labels from a single label assignment; meanwhile, clustering itself does not require labeled data. Based on this idea, an active learning method was proposed and evaluated for sound classification. In the experiments, the proposed active learning method based on k-medoids clustering outperformed reference methods based on random sampling and uncertainty sampling. In order to optimize the sample selection after annotating the k medoids, mismatch-first farthest-traversal was proposed. The active learning performances were further improved according to the experimental results.
The active learning method proposed for sound classification was extended to sound event detection. Sound segments were generated based on change point detection within each recording. The sound segments were selected for annotation based on mismatch-first farthest-traversal. During the training of acoustic models, each recording was used as an input of a recurrent convolutional neural network. The training loss was derived from frames corresponding to only annotated segments. In the experiments on a dataset where sound events are rare, the proposed active learning method required annotating only 2% of the training data to achieve similar accuracy, with respect to annotating all the training data.
In addition to active learning, we investigated using cluster analysis to group recordings with similar recording conditions. Feature normalization according to cluster statistics was used to bridge the distribution shift due to mismatched recording conditions. The achieved performance clearly outperformed feature normalization based on global statistics and statistics per recording.
The proposed active learning methods enable efficient labeling on large-scale audio datasets, potentially saving a large amount of annotation effort in the development of acoustic models. In addition, core ideas behind the proposed methods are generic and they can be extended to other problems such as natural language processing, as is investigated in [8].
In this thesis, we started with the idea of clustering unlabeled audio data. Clustering results can be used to derive propagated labels from a single label assignment; meanwhile, clustering itself does not require labeled data. Based on this idea, an active learning method was proposed and evaluated for sound classification. In the experiments, the proposed active learning method based on k-medoids clustering outperformed reference methods based on random sampling and uncertainty sampling. In order to optimize the sample selection after annotating the k medoids, mismatch-first farthest-traversal was proposed. The active learning performances were further improved according to the experimental results.
The active learning method proposed for sound classification was extended to sound event detection. Sound segments were generated based on change point detection within each recording. The sound segments were selected for annotation based on mismatch-first farthest-traversal. During the training of acoustic models, each recording was used as an input of a recurrent convolutional neural network. The training loss was derived from frames corresponding to only annotated segments. In the experiments on a dataset where sound events are rare, the proposed active learning method required annotating only 2% of the training data to achieve similar accuracy, with respect to annotating all the training data.
In addition to active learning, we investigated using cluster analysis to group recordings with similar recording conditions. Feature normalization according to cluster statistics was used to bridge the distribution shift due to mismatched recording conditions. The achieved performance clearly outperformed feature normalization based on global statistics and statistics per recording.
The proposed active learning methods enable efficient labeling on large-scale audio datasets, potentially saving a large amount of annotation effort in the development of acoustic models. In addition, core ideas behind the proposed methods are generic and they can be extended to other problems such as natural language processing, as is investigated in [8].
Original language | English |
---|---|
Place of Publication | Tampere |
Publisher | Tampere University |
ISBN (Electronic) | 978-952-03-2266-3 |
ISBN (Print) | 978-952-03-2265-6 |
Publication status | Published - 2022 |
Publication type | G5 Doctoral dissertation (articles) |
Publication series
Name | Tampere University Dissertations - Tampereen yliopiston väitöskirjat |
---|---|
Volume | 541 |
ISSN (Print) | 2489-9860 |
ISSN (Electronic) | 2490-0028 |