A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Vladimir Iashin, Esa Rahtu

Research output: Chapter in Book/Report/Conference proceedingConference contributionProfessional

128 Downloads (Pure)

Abstract

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt
Original languageEnglish
Title of host publicationThe 31st British Machine Vision Virtual Conference
Subtitle of host publication7th - 10th September 2020
PublisherBMVA Press
Number of pages16
Publication statusPublished - 10 Sept 2020
Publication typeD3 Professional conference proceedings
EventBritish Machine Vision Conference - Virtual
Duration: 7 Sept 202010 Sept 2020
Conference number: 31
https://www.bmvc2020-conference.com/

Conference

ConferenceBritish Machine Vision Conference
Abbreviated titleBMVC 2020
Period7/09/2010/09/20
Internet address

Fingerprint

Dive into the research topics of 'A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer'. Together they form a unique fingerprint.

Cite this