Siirry päänavigointiin Siirry hakuun Siirry pääsisältöön

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Tutkimustuotos: KonferenssiartikkeliTieteellinenvertaisarvioitu

6 Sitaatiot (Scopus)
24 Lataukset (Pure)

Abstrakti

Systems that can find correspondences between multiple modal-
ities, such as between speech and images, have great potential
to solve different recognition and data analysis tasks in an un-
supervised manner. This work studies multimodal learning in
the context of visually grounded speech (VGS) models, and fo-
cuses on their recently demonstrated capability to extract spa-
tiotemporal alignments between spoken words and the corre-
sponding visual objects without ever been explicitly trained for
object localization or word recognition. As the main contribu-
tions, we formalize the alignment problem in terms of an au-
diovisual alignment tensor that is based on earlier VGS work,
introduce systematic metrics for evaluating model performance
in aligning visual objects and spoken words, and propose a new
VGS model variant for the alignment task utilizing cross-modal
attention layer. We test our model and a previously proposed
model in the alignment task using SPEECH-COCO captions
coupled with MSCOCO images. We compare the alignment
performance using our proposed evaluation metrics to the se-
mantic retrieval task commonly used to evaluate VGS models.
We show that cross-modal attention layer not only helps the
model to achieve higher semantic cross-modal retrieval perfor-
mance, but also leads to substantial improvements in the align-
ment performance between image object and spoken words.
AlkuperäiskieliEnglanti
Otsikko22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
KustantajaInternational Speech Communication Association ISCA
Sivut2996-3000
Sivumäärä5
ISBN (elektroninen)9781713836902
DOI - pysyväislinkit
TilaJulkaistu - 2021
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaAnnual Conference of the International Speech Communication Association - Brno, Tshekki
Kesto: 30 elok. 20213 syysk. 2021

Julkaisusarja

NimiProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Vuosikerta2
ISSN (painettu)2308-457X
ISSN (elektroninen)1990-9772

Conference

ConferenceAnnual Conference of the International Speech Communication Association
Maa/AlueTshekki
KaupunkiBrno
Ajanjakso30/08/213/09/21

Julkaisufoorumi-taso

  • Jufo-taso 1

Sormenjälki

Sukella tutkimusaiheisiin 'Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä