Monocular Depth Estimation Primed by Salient Point Detection and Normalized Hessian Loss

Lam Huynh, Matteo Pedone, P. Nguyen, Jiri Matas, Esa Rahtu, Janne Heikkilä

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

1 Downloads (Pure)


Deep neural networks have recently thrived on single image depth estimation. That being said, current developments on this topic highlight an apparent compromise between accuracy and network size. This work proposes an accurate and lightweight framework for monocular depth estimation based on a self-attention mechanism stemming from salient point detection. Specifically, we utilize a sparse set of keypoints to train a FuSaNet model that consists of two major components: Fusion-Net and Saliency-Net. In addition, we introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy. The proposed method achieves state-of-the-art results on NYU-Depth-v2 and KITTI while using 3.1-38.4 times smaller model in terms of the number of parameters than baseline approaches. Experiments on the SUN-RGBD further demonstrate the generalizability of the proposed method.
Original languageEnglish
Title of host publication2021 International Conference on 3D Vision (3DV)
Number of pages11
ISBN (Electronic)978-1-6654-2688-6
Publication statusPublished - 2021
Publication typeA4 Article in conference proceedings
EventInternational Conference on 3D Vision - London, United Kingdom
Duration: 1 Dec 20213 Dec 2021

Publication series

ISSN (Electronic)2475-7888


ConferenceInternational Conference on 3D Vision
Country/TerritoryUnited Kingdom


  • Image sensors
  • Deep learning
  • Three-dimensional displays
  • Neural networks
  • Estimation
  • Sensors
  • saliency detection
  • self-attention
  • normalized Hessian loss

Publication forum classification

  • Publication forum level 1


Dive into the research topics of 'Monocular Depth Estimation Primed by Salient Point Detection and Normalized Hessian Loss'. Together they form a unique fingerprint.

Cite this