Low-Latency Single-Channel Speech Separation with Deep Neural Networks

Research output: Book/ReportDoctoral thesisCollection of Articles

Abstract

Selectively attending to speakers of interest in complex acoustic environments comes naturally to humans but has proven difficult for machines. Speech separation has been a fundamental area of research in audio signal processing. It involves isolating and extracting individual speech sources from an audio mixture that may contain multiple overlapping sources. This dissertation aims to develop techniques for solving this in situations where a single microphone is available, also known as single-channel speech separation (SCSS), and for applications constrained with stringent latency requirements, e.g., hearing aids. This is especially challenging due to the resulting poor spectral resolution and, hence, a significant overlap between the constituent sources in the time-frequency domain.

In recent years, deep neural networks (DNNs) have revolutionized the field of SCSS, showing remarkable performance. These possess powerful modeling abilities and use a data-driven approach to learn the underlying patterns and dependencies within the audio signals. However, their application in scenarios with very stringent algorithmic latency constraints (< 10 ms) has not been explored much. This gap in knowledge serves as the core motivation for this dissertation.

In this dissertation, we investigated different neural network architectures, i.e., feedforward, recurrent, and convolutional recurrent architectures, for low-latency SCSS, basing our evaluation on objective metrics as well as subjective listening tests with hearing-impaired (HI) listeners. We showed subjective benefits to HI listeners with our methods, along with a comprehensive statistical analysis of the obtained results.

Moreover, we proposed a novel loss function for training DNNs incorporating an objective metric measuring speech intelligibility, which exhibited an improvement in speech intelligibility in terms of the optimization target. Additionally, a low-latency modification to a popular SCSS method known as deep clustering was proposed. Finally, we proposed an asymmetric windowing-based feature extraction scheme to mitigate the poor spectral resolution inherent to the low-latency SCSS, which showed improvement over the symmetric windowing baseline.
Original languageEnglish
Place of PublicationTampere
PublisherTampere University
ISBN (Electronic)978-952-03-3430-7
ISBN (Print)978-952-03-3429-1
Publication statusPublished - 2024
Publication typeG5 Doctoral dissertation (articles)

Publication series

NameTampere University Dissertations - Tampereen yliopiston väitöskirjat
Volume1017
ISSN (Print)2489-9860
ISSN (Electronic)2490-0028

Fingerprint

Dive into the research topics of 'Low-Latency Single-Channel Speech Separation with Deep Neural Networks'. Together they form a unique fingerprint.

Cite this