TY - JOUR
T1 - The CORSMAL benchmark for the prediction of the properties of containers
AU - Xompero, Alessio
AU - Donaher, Santiago
AU - Iashin, Vladimir
AU - Palermo, Francesca
AU - Solak, Gokhan
AU - Coppola, Claudio
AU - Ishikawa, Reina
AU - Nagao, Yuichi
AU - Hachiuma, Ryo
AU - Liu, Qi
AU - Feng, Fan
AU - Lan, Chuanlin
AU - Chan, Rosa H.M.
AU - Christmann, Guilherme
AU - Song, Jyun Ting
AU - Neeharika, Gonuguntla
AU - Reddy, Chinnakotla K.T.
AU - Jain, Dinesh
AU - Rehman, Bakhtawar Ur
AU - Cavallaro, Andrea
N1 - Publisher Copyright:
Author
PY - 2022
Y1 - 2022
N2 - The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this problem challenging. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct a fair and in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and filling mass with audio-visual approaches, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores. These results show that there is still room of improvement for the design of future methods that will be ranked and compared on the individual leaderboards provided by our open framework.
AB - The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this problem challenging. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct a fair and in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and filling mass with audio-visual approaches, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores. These results show that there is still room of improvement for the design of future methods that will be ranked and compared on the individual leaderboards provided by our open framework.
KW - Acoustic signal processing
KW - audio-visual classification
KW - Containers
KW - Convolutional neural networks
KW - Estimation
KW - Filling
KW - image and video signal processing
KW - object properties recognition
KW - Robots
KW - Spectrogram
KW - Task analysis
U2 - 10.1109/ACCESS.2022.3166906
DO - 10.1109/ACCESS.2022.3166906
M3 - Article
AN - SCOPUS:85128273552
SN - 2169-3536
VL - 10
SP - 41388
EP - 41402
JO - IEEE Access
JF - IEEE Access
ER -