TY - JOUR
T1 - Self-labeling sounds using optimal transport
AU - Harju, Manu
AU - Font, Frederic
AU - Mesaros, Annamaria
PY - 2026
Y1 - 2026
N2 - Self-labeling is a method to simultaneously learn representations and classes using unlabeled data. The naive approach to self-labeling leads to a degenerate solution, and the model-generated labels require regularization to serve as useful training targets. In this work, we adapt a self-labeling method using optimal transport to the audio domain using the FSD50K dataset. We analyze the structure of the learned representations and compare the emergent classes with the reference annotations. We compare the learned representations with the ones produced using Bootstrap Your Own Latent for Audio (BYOL-A) across several downstream tasks. Our findings indicate that the method learns to group perceptually similar sounds without supervision. The results show that the method is a viable approach for audio representation learning, and that the learned embeddings are as effective for downstream tasks as the ones obtained with the benchmark method. As an additional outcome, the generated classifications give valuable insight into what the model learns, promoting explainability in feature learning.
AB - Self-labeling is a method to simultaneously learn representations and classes using unlabeled data. The naive approach to self-labeling leads to a degenerate solution, and the model-generated labels require regularization to serve as useful training targets. In this work, we adapt a self-labeling method using optimal transport to the audio domain using the FSD50K dataset. We analyze the structure of the learned representations and compare the emergent classes with the reference annotations. We compare the learned representations with the ones produced using Bootstrap Your Own Latent for Audio (BYOL-A) across several downstream tasks. Our findings indicate that the method learns to group perceptually similar sounds without supervision. The results show that the method is a viable approach for audio representation learning, and that the learned embeddings are as effective for downstream tasks as the ones obtained with the benchmark method. As an additional outcome, the generated classifications give valuable insight into what the model learns, promoting explainability in feature learning.
U2 - 10.1109/OJSP.2026.3659053
DO - 10.1109/OJSP.2026.3659053
M3 - Article
SN - 2644-1322
VL - 7
JO - IEEE Open Journal of Signal Processing
JF - IEEE Open Journal of Signal Processing
ER -