Binaural rendering of microphone array captures based on source separation

Joonas Nikunen, Aleksandr Diment, Tuomas Virtanen, Miikka Vilermo

    Research output: Contribution to journalArticleScientificpeer-review

    1 Citation (Scopus)


    This paper proposes a method for binaural reconstruction of a sound scene captured with a portable-sized array consisting of several microphones. The proposed processing is separating the scene into a sum of small number of sources, and the spectrogram of each of them is in turn represented as a small number of latent components. The direction of arrival (DOA) of each source is estimated, which is followed by binaural rendering of each source at its estimated direction. For representing the sources, the proposed method uses low-rank complex-valued non-negative matrix factorization combined with DOA-based spatial covariance matrix model. The binaural reconstruction is achieved by applying the binaural cues (head-related transfer function) associated with the estimated source DOA to the separated source signals. The binaural rendering quality of the proposed method was evaluated using a speech intelligibility test. The test results indicated that the proposed binaural rendering was able to improve the intelligibility of speech over stereo recordings and separation by minimum variance distortionless response beamformer with the same binaural synthesis in a three-speaker scenario. An additional listening test evaluating the subjective quality of the rendered output indicates no added processing artifacts by the proposed method in comparison to unprocessed stereo recording.
    Original languageEnglish
    Pages (from-to)157–169
    JournalSpeech Communication
    Publication statusPublished - 2016
    Publication typeA1 Journal article-refereed


    • Binaural processing
    • Source separation
    • Speech intelligibility
    • Non-negative matrix factorization

    Publication forum classification

    • Publication forum level 2


    Dive into the research topics of 'Binaural rendering of microphone array captures based on source separation'. Together they form a unique fingerprint.

    Cite this