Abstract: Visual-based 3D detection is drawing a lot of attention recently. Despite the best efforts from the computer vision researchers visual-based 3D detection remains a largely unsolved problem. This is primarily due to the lack of accurate depth perception provided by LiDAR sensors. Previous works struggle to fuse 3D spatial information and the RGB image effectively. In this paper, we propose a novel monocular 3D detection framework to address this problem. Specifically, we propose to primary contributions: (i) We design an Adaptive Depth-guided Instance Normalization layer to leverage depth features to guide RGB features for high quality estimation of 3D properties. (ii) We introduce a Dynamic Depth Transformation module to better recover accurate depth according to semantic context learning and thus facilitate the removal of depth ambiguities that exist in the RGB image. Experiments show that our approach achieves state-of-the-art on KITTI 3D detection benchmark among current monocular 3D detection works.

SlidesLive

Similar Papers

Weakly-supervised Reconstruction of 3D Objects with Large Shape Variation from Single In-the-Wild Images
Shichen Sun (Sichuan University), Zhengbang Zhu (Sichuan University), Xiaowei Dai (Sichuan University), Qijun Zhao (Sichuan University)*, Jing Li (Sichuan University)
DeepVoxels++: Enhancing the Fidelity of Novel View Synthesis from 3D Voxel Embeddings
Tong He (UCLA)*, John Collomosse (Adobe Research), Hailin Jin (Adobe Research), Stefano Soatto (UCLA)
In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization
Minsong Ki (Yonsei University)*, Youngjung Uh (Yonsei University), Wonyoung Lee (Yonsei University), Hyeran Byun (Yonsei University)