GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

3D语义占有预测旨在获得周围场景的3D细粒度几何和语义,这对于视觉中心自驾驶的鲁棒性至关重要。大多数现有方法采用密集网格(如体素)作为场景表示,忽略了占有稀疏性和对象尺寸的多样性,从而导致资源分配的不平衡。为了解决这个问题,我们提出了一个以物体为中心的表示来描述稀疏3D语义高斯分布的3D场景,其中每个高斯表示一个灵活的区域和其语义特征。我们通过关注机制从图像中聚合信息,并迭代优化3D高斯包括位置、协方差和语义特征。然后我们提出了一种高效的从高斯到体素的平铺方法来生成3D占有预测,它只聚合了某个位置的邻居高斯。我们在广泛采用的nuScenes和KITTI-360数据集上进行了广泛的实验。实验结果表明,GaussianFormer在仅占其内存消耗17.8% – 24.8%的情况下,与最先进的方法具有相当的表现。代码可在此处下载:https://this URL。

3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% – 24.8% of their memory consumption. Code is available at: this https URL.

https://arxiv.org/abs/2405.17429

https://arxiv.org/pdf/2405.17429.pdf

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注