Siirry päänavigointiin Siirry hakuun Siirry pääsisältöön

Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction

Tutkimustuotos: KonferenssiartikkeliTieteellinenvertaisarvioitu

3 Lataukset (Pure)

Abstrakti

We propose a novel approach that utilize inter-speaker relative cues for distinguishing target speakers and extracting their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categories. Relative cues offers greater flexibility than fixed speech attribute classification, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues like pitch level, loudness, distance, speaking duration, language, and pitch range also demonstrate notable benefit in complex scenarios. Fine-tuning pre-trained WavLM Base+ CNN encoders improves overall performance over the baseline of using only a Conv1d encoder.

AlkuperäiskieliEnglanti
OtsikkoProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
KustantajaISCA
Sivut1918-1922
DOI - pysyväislinkit
TilaJulkaistu - 2025
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaInterspeech - Rotterdam, Alankomaat
Kesto: 17 elok. 202521 elok. 2025

Julkaisusarja

NimiProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (painettu)2308-457X

Conference

ConferenceInterspeech
Maa/AlueAlankomaat
KaupunkiRotterdam
Ajanjakso17/08/2521/08/25

Julkaisufoorumi-taso

  • Jufo-taso 1

!!ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Language and Linguistics
  • Modelling and Simulation
  • Human-Computer Interaction

Sormenjälki

Sukella tutkimusaiheisiin 'Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä