Abstrakti
Systems that can find correspondences between multiple modal-
ities, such as between speech and images, have great potential
to solve different recognition and data analysis tasks in an un-
supervised manner. This work studies multimodal learning in
the context of visually grounded speech (VGS) models, and fo-
cuses on their recently demonstrated capability to extract spa-
tiotemporal alignments between spoken words and the corre-
sponding visual objects without ever been explicitly trained for
object localization or word recognition. As the main contribu-
tions, we formalize the alignment problem in terms of an au-
diovisual alignment tensor that is based on earlier VGS work,
introduce systematic metrics for evaluating model performance
in aligning visual objects and spoken words, and propose a new
VGS model variant for the alignment task utilizing cross-modal
attention layer. We test our model and a previously proposed
model in the alignment task using SPEECH-COCO captions
coupled with MSCOCO images. We compare the alignment
performance using our proposed evaluation metrics to the se-
mantic retrieval task commonly used to evaluate VGS models.
We show that cross-modal attention layer not only helps the
model to achieve higher semantic cross-modal retrieval perfor-
mance, but also leads to substantial improvements in the align-
ment performance between image object and spoken words.
ities, such as between speech and images, have great potential
to solve different recognition and data analysis tasks in an un-
supervised manner. This work studies multimodal learning in
the context of visually grounded speech (VGS) models, and fo-
cuses on their recently demonstrated capability to extract spa-
tiotemporal alignments between spoken words and the corre-
sponding visual objects without ever been explicitly trained for
object localization or word recognition. As the main contribu-
tions, we formalize the alignment problem in terms of an au-
diovisual alignment tensor that is based on earlier VGS work,
introduce systematic metrics for evaluating model performance
in aligning visual objects and spoken words, and propose a new
VGS model variant for the alignment task utilizing cross-modal
attention layer. We test our model and a previously proposed
model in the alignment task using SPEECH-COCO captions
coupled with MSCOCO images. We compare the alignment
performance using our proposed evaluation metrics to the se-
mantic retrieval task commonly used to evaluate VGS models.
We show that cross-modal attention layer not only helps the
model to achieve higher semantic cross-modal retrieval perfor-
mance, but also leads to substantial improvements in the align-
ment performance between image object and spoken words.
| Alkuperäiskieli | Englanti |
|---|---|
| Otsikko | 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 |
| Kustantaja | International Speech Communication Association ISCA |
| Sivut | 2996-3000 |
| Sivumäärä | 5 |
| ISBN (elektroninen) | 9781713836902 |
| DOI - pysyväislinkit | |
| Tila | Julkaistu - 2021 |
| OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
| Tapahtuma | Annual Conference of the International Speech Communication Association - Brno, Tshekki Kesto: 30 elok. 2021 → 3 syysk. 2021 |
Julkaisusarja
| Nimi | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
|---|---|
| Vuosikerta | 2 |
| ISSN (painettu) | 2308-457X |
| ISSN (elektroninen) | 1990-9772 |
Conference
| Conference | Annual Conference of the International Speech Communication Association |
|---|---|
| Maa/Alue | Tshekki |
| Kaupunki | Brno |
| Ajanjakso | 30/08/21 → 3/09/21 |
Julkaisufoorumi-taso
- Jufo-taso 1
Sormenjälki
Sukella tutkimusaiheisiin 'Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.Siteeraa tätä
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver