Synchformer: Efficient Synchronization From Sparse Cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Our objective is audio-visual synchronization with a focus on ‘in-the-wild’ videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale ‘in-the-wild’ dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability. robots.ox.ac.uk/~vgg/research/synchformer
Original languageEnglish
Title of host publicationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherIEEE
Pages5325-5329
ISBN (Electronic)979-8-3503-4485-1
DOIs
Publication statusPublished - 2024
Publication typeA4 Article in conference proceedings
EventIEEE International Conference on Acoustics, Speech and Signal Processing - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Publication series

Name
ISSN (Electronic)2379-190X

Conference

ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing
Country/TerritoryKorea, Republic of
CitySeoul
Period14/04/2419/04/24

Publication forum classification

  • Publication forum level 2

Fingerprint

Dive into the research topics of 'Synchformer: Efficient Synchronization From Sparse Cues'. Together they form a unique fingerprint.

Cite this