TY - GEN
T1 - Automatic analysis of the emotional content of speech in daylong child-centered recordings from a neonatal intensive care unit
AU - Vaaras, Einari
AU - Ahlqvist-Björkroth, Sari
AU - Drossos, Konstantinos
AU - Räsänen, Okko
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
jufoid=59094
PY - 2021
Y1 - 2021
N2 - Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.
AB - Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.
KW - Daylong audio
KW - Lena recorder
KW - Real-world audio
KW - Speech analysis
KW - Speech emotion recognition
U2 - 10.21437/Interspeech.2021-303
DO - 10.21437/Interspeech.2021-303
M3 - Conference contribution
AN - SCOPUS:85119258946
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3380
EP - 3384
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - Annual Conference of the International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -