TY - GEN
T1 - Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction
AU - Dai, Wang
AU - Politis, Archontis
AU - Virtanen, Tuomas
N1 - Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.
PY - 2025
Y1 - 2025
N2 - We propose a novel approach that utilize inter-speaker relative cues for distinguishing target speakers and extracting their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categories. Relative cues offers greater flexibility than fixed speech attribute classification, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues like pitch level, loudness, distance, speaking duration, language, and pitch range also demonstrate notable benefit in complex scenarios. Fine-tuning pre-trained WavLM Base+ CNN encoders improves overall performance over the baseline of using only a Conv1d encoder.
AB - We propose a novel approach that utilize inter-speaker relative cues for distinguishing target speakers and extracting their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categories. Relative cues offers greater flexibility than fixed speech attribute classification, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues like pitch level, loudness, distance, speaking duration, language, and pitch range also demonstrate notable benefit in complex scenarios. Fine-tuning pre-trained WavLM Base+ CNN encoders improves overall performance over the baseline of using only a Conv1d encoder.
KW - LLM
KW - pre-trained model
KW - target speech extraction
UR - https://www.scopus.com/pages/publications/105020091972
U2 - 10.21437/Interspeech.2025-1554
DO - 10.21437/Interspeech.2025-1554
M3 - Conference contribution
AN - SCOPUS:105020091972
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 1918
EP - 1922
BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PB - ISCA
T2 - Interspeech
Y2 - 17 August 2025 through 21 August 2025
ER -