ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

在3D物体检测任务领域,将来自激光雷达和相机传感器的异构特征融合成一个统一的三维鸟视(BEV)表示是一种被广泛采用的方法。然而,现有的方法常常受到不精确的传感器校准的影响,导致在激光雷达-相机BEV融合中特征对齐误差。此外,这些误差还会导致相机分支的深度估计误差,最终导致激光雷达和相机BEV特征的对齐误差。

在这项工作中,我们提出了一种新颖的ContrastAlign方法,该方法利用对比学习来增强异质模态的对齐,从而提高融合过程的鲁棒性。具体来说,我们的方法包括L-Instance模块,该模块在激光雷达BEV特征中直接输出激光雷达实例特征。然后,我们引入了C-Instance模块,通过在相机BEV特征上进行区域池化来预测相机实例特征。我们提出 InstanceFusion 模块,该模块利用对比学习在异质模态之间生成类似实例特征。然后,我们使用图匹配来计算相邻相机实例特征和相似实例特征之间的相似度,以完成实例特征的对齐。

我们的方法在nuScenes验证集上实现了最先进的性能,具有mAP值为70.3%,比BEVFusion高出1.8%。在存在错位噪声的情况下,我们的方法比BEVFusion高出7.3%的性能。重要的是,我们的方法在错位噪声的情况下超过了BEVFusion。

In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird’s Eye View (BEV) representation is a widely adopted paradigm. However, existing methods are often compromised by imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a novel ContrastAlign approach that utilizes contrastive learning to enhance the alignment of heterogeneous modalities, thereby improving the robustness of the fusion process. Specifically, our approach includes the L-Instance module, which directly outputs LiDAR instance features within LiDAR BEV features. Then, we introduce the C-Instance module, which predicts camera instance features through RoI (Region of Interest) pooling on the camera BEV features. We propose the InstanceFusion module, which utilizes contrastive learning to generate similar instance features across heterogeneous modalities. We then use graph matching to calculate the similarity between the neighboring camera instance features and the similarity instance features to complete the alignment of instance features. Our method achieves state-of-the-art performance, with an mAP of 70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set. Importantly, our method outperforms BEVFusion by 7.3% under conditions with misalignment noise.

https://arxiv.org/abs/2405.16873

https://arxiv.org/pdf/2405.16873.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注