AVCaps: An audio-visual dataset with modality-specific captions

Dataset

Description

The AVCaps dataset is an audio-visual captioning resource designed to advance research in multimodal machine perception. Derived from the VidOR dataset, it features 2061 video clips spanning a total of 28.8 hours.
Date made available20 Dec 2024
PublisherZenodo

Funding

FundersFunder number
Jane and Aatos Erkko Foundation

    Field of science, Statistics Finland

    • 113 Computer and information sciences

    Cite this