Unsupervised Interpretable Representation Learning for Singing Voice Separation

Stylianos Ioannis Mimilakis, Konstantinos Drossos, Gerald Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations to the task of informed singing voice separation via binary masking, and measure the obtained separation quality by means of scale-invariant signal to distortion ratio. Our findings suggest that our method is capable of learning meaningful representations for singing voice separation, while preserving conveniences of the the short-time Fourier transform like non-negativity, smoothness, and reconstruction subject to time-frequency masking, that are desired in audio and music source separation.
Original languageEnglish
Title of host publication28th European Signal Processing Conference
Number of pages5
ISBN (Electronic)978-9-0827-9705-3
Publication statusPublished - 2020
Publication typeA4 Article in a conference publication
EventEuropean Signal Processing Conference - Beurs van Berlage, Amsterdam, Netherlands
Duration: 18 Jan 202122 Jan 2021
Conference number: 28

Publication series

NameEuropean Signal Processing Conference
ISSN (Print)2219-5491


ConferenceEuropean Signal Processing Conference
Abbreviated titleEUSIPCO2020
Internet address

Publication forum classification

  • Publication forum level 1


Dive into the research topics of 'Unsupervised Interpretable Representation Learning for Singing Voice Separation'. Together they form a unique fingerprint.

Cite this