TY - JOUR
T1 - Zero-Shot Audio Classification Via Semantic Embeddings
AU - Xie, Huang
AU - Virtanen, Tuomas
N1 - Funding Information:
Manuscript received August 5, 2020; revised November 19, 2020 and February 11, 2021; accepted March 3, 2021. Date of publication March 11, 2021; date of current version March 26, 2021. This work was supported by the European Research Council under the European Union’s H2020 Framework Program through ERC Grant Agreement 637422 EVERYSOUND. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Wenwu Wang. (Corresponding author: Huang Xie.) The authors are with the Faculty of Information Technology, and Communication Sciences, Tampere University, Tampere 33720, Finland (e-mail: huang.xie@tuni.fi; tuomas.virtanen@tuni.fi). Digital Object Identifier 10.1109/TASLP.2021.3065234
Funding Information:
This work was supported by the European Research Council under the European Union's H2020 Framework Program through ERC Grant Agreement 637422 EVERYSOUND.
Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.
AB - In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.
KW - Audio classification
KW - semantic embedding
KW - zero-shot learning
U2 - 10.1109/TASLP.2021.3065234
DO - 10.1109/TASLP.2021.3065234
M3 - Article
AN - SCOPUS:85102679921
SN - 2329-9290
VL - 29
SP - 1233
EP - 1242
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -