自动驾驶3D目标检测综述(一)
文章地址:[2206.09474] 3D Object Detection for Autonomous Driving: A Comprehensive Survey (arxiv.org)
这篇综述浅显易懂,并被广泛认为是理解自动驾驶和3D目标检测的基础内容。它非常适合热爱自动驾驶和3D目标检测的读者作为入门材料,并帮助他们初步掌握相关算法的基本概念。
目录
一、摘要
(一)原文
(二)翻译
(三)关键词
二、介绍
(一)原文
(二)翻译
三、背景
(一)原文
(二)翻译
(三)理解
一、摘要
(一)原文
Autonomous driving has garnered growing interest in recent years due to its potential benefits in alleviating drivers’ daily hassles and enhancing road safety. Within modern autonomous driving systems, the perception module stands as a crucial component, tasked with accurately assessing its surroundings and furnishing dependable data for decision-making processes. The core functionality of 3D object detection involves predicting the positions, dimensions, and classifications of nearby 3D objects relative to an autonomous vehicle; this specialized technology forms an essential part of a perception system. This paper presents a comprehensive review of advancements in 3D object detection for autonomous driving applications. Initially, we outline the historical development trajectory and identify key challenges inherent to this domain. Subsequently, we conduct an extensive survey encompassing model-based and sensor-based approaches from both single-modality (e.g., LiDAR-based) and multi-modal perspectives (e.g., camera-based). Additionally, we critically analyze the strengths and limitations associated with each methodological framework. Furthermore, we systematically examine practical implementations across diverse automotive systems. Finally, we evaluate performance metrics across various 3D object detection algorithms while offering insights into historical trends and future research directions.
(二)翻译
近年来,在减轻驾驶员负担并提升驾驶安全性的潜力驱使下,自动驾驶获得了越来越多的关注。感知系统是现代自动驾驶系统中不可或缺的重要组成部分,在现代自动驾驶管道中占据核心地位。为了精确估计周围环境的状态并支持预测与规划工作流程的可靠观察生成能力,《基于3D对象检测》技术应运而生。该技术的主要目标是准确预测自动驾驶汽车周围物体的位置、尺寸及其类别。鉴于其重要性,《三维目标检测》研究领域已逐步发展成熟,并受到了学术界与工业界的广泛关注。本文旨在回顾这一领域的发展历程与主要研究方向。具体而言,在本文的第一部分中我们将简要介绍《三维目标检测》的相关背景并探讨这一任务所面临的主要挑战。其次我们将从模型架构与输入传感器两个维度对《三维物体检测》的技术进展进行全面梳理包括基于LiDAR基于摄像头以及多模态检测方法这三大类别的典型代表并深入分析每种方法的优势与局限性所在随后我们将重点考察《三维目标检测》技术在实际驾驶系统中的具体应用场景及其实现方式最后我们将对各类型《三维目标检测》算法的表现指标进行深入比较并总结近年来的研究热点并对未来的发展方向提出展望
(三)关键词
3D目标检测、感知、自动驾驶、深度学习、计算机视觉、机器人学
二、介绍
(一)原文
自动驾驶技术旨在使车辆能够依靠传感器系统实现对周围环境的智能感知,并在较少甚至无需人类干预的情况下安全行驶。近年来该技术已取得了显著进展。自动驾驶技术已在多种场景中得到了广泛应用,包括自动驾驶卡车、无人驾驶出租车、配送机器人等,并有助于减少人为错误,提升道路安全水平。作为自动驾驶系统的核心组件之一,汽车感知系统通过多模态数据(如摄像头捕捉到的画面、LiDAR扫描仪生成的点云图以及高清地图等)为自动驾驶车辆提供环境信息输入;感知系统通常会将这些多模态数据作为输入,并对道路关键要素的几何形状及语义信息进行预测;高质量的感知结果则为后续步骤——如物体追踪、运动轨迹预测以及路径规划——提供了可靠的观察依据。
Among these perception tasks within an automotive setup, 3D object detection stands as one of the most essential components required for effective scene interpretation.
The primary objective of 3D object detection involves identifying the positions、dimensions、and categories of key entities such as vehicles、pedestrians、and cyclists within a three-dimensional spatial framework.
In comparison with traditional 2D object detection methods that yield planar bounding boxes on image planes without accounting for actual distances relative to the vehicle's own position、the focus of 3D object detection lies in accurately localizing and categorizing objects within real-world three-dimensional coordinates.
Significant advancements in 3D object detection methods have been observed since the integration of deep learning techniques in computer vision and robotics. Researchers have been actively exploring innovative strategies to address the 3D object detection challenge from diverse perspectives, such as leveraging different sensory modalities or data representation schemes. However, these efforts have not provided a comprehensive comparison across different categories, which limits our understanding of the strengths and weaknesses inherent in various approaches. Given the importance of understanding the strengths and weaknesses across various approaches, it is imperative to conduct a thorough comparative analysis to provide meaningful insights for future research.
In order to thoroughly explore the 3D object detection methods for autonomous driving, we present a comprehensive review, examining and comparing various approaches across different categories. Unlike previous surveys (Arnold et al., 2019; Liang et al., 2021b; Qian et al., 2021b), our paper covers comprehensively recent advancements in this field, including 3D object detection from range images, self-/semi-/weakly-supervised 3D object detection, and 3D detection in end-to-end driving systems. Unlike prior surveys that focused solely on point cloud-based detection (Guo et al., 2020; Fernandes et al., 2021; Zamanakos et al., 2021), or monocular image-based detection (Wu et al., 2020a; Ma et al., 2022), or multi-modal input-based detection (Wang et al., 2021h), our study systematically investigates all sensory types and most application scenarios. The key contributions of this study are outlined below:
我们从不同视角提供了对三维物体检测方法的全面综述。具体而言,在基于LiDAR、基于摄像头以及多模态检测的基础上,并从时间序列中进行检测,在高效标注检测法的基础上,在驾驶系统中进行了应用。
We classify 3D object detection approaches systematically and in an organized manner, present a comprehensive overview of these methods, and offer critical perspectives on the advantages and limitations of different categories of techniques.
Through an in-depth evaluation of both performance and speed metrics, we assess various 3D object detection methodologies. By examining the evolution of research trends over the years, we aim to offer valuable insights into future research directions in this field. This offers valuable insights into future research directions in 3D object detection.
本论文的结构安排如下:首先,在第2节中我们介绍了3D物体检测问题定义、数据集以及评估指标的基本概念。随后,在第3至第6节中我们系统回顾并分析了基于LiDAR传感器、摄像头、多传感器融合以及Transformer架构的不同方法。接着,在第7节中我们介绍了利用时空数据进行检测的方法,并在第8节探讨了利用少标签进行检测的技术。随后,在第9节中我们讨论了一些在驾驶系统中的关键问题。最后,在第10节中我们进行了速度与性能分析,并探讨了未来的研究趋势及发展方向。图1展示了层次化的分类 taxononomy结构图,并附有更新 maintained 的项目主页链接

(二)翻译
近年来,在自动驾驶技术领域中的一项重要目标是实现车辆在复杂环境中的智能感知与安全移动。这项技术已在无人驾驶卡车、服务机器人以及配送机器人等多个实际应用场景中得到广泛应用,并显著降低了人为操作失误的风险的同时提升了道路安全性。作为核心组件之一的汽车感知系统通过多模态数据输入帮助自动驾驶系统了解其周围的环境情况。这些多模态数据包括来自摄像头的图像信息激光雷达扫描仪生成的点云数据以及精确的地图信息等。这些高质量的数据为后续关键任务如目标追踪轨迹预测以及路径规划提供了可靠的依据
为了全面掌握驾驶环境信息, 感知系统需涵盖多个视觉相关任务, 包括目标检测与跟踪、车道识别与语义分割等技术手段。其中, 三维物体检测作为汽车感知系统的核心模块, 具有重要地位。该技术旨在识别现实世界中物体的关键属性, 包括位置、尺寸以及类别等维度的信息。相较于仅基于二维图像生成边界框并忽视实际距离信息的传统二维目标检测方法, 三维目标检测更加注重精确定位和分类识别能力的提升。其通过解析现实坐标系中的几何数据, 可直接计算出自身车辆与其他关键物体之间的距离关系, 并据此辅助路径规划以实现安全行驶
基于深度学习技术在计算机视觉与机器人技术领域的进展
为此,我们建议全面回顾自动驾驶应用的 3D 对象检测方法,并对不同类别的方法进行了深入分析和系统比较。与现有的调查相比(Arnold 等人,2019;Liang 等人,2021b;Qian 等人,2021b),我们的论文广泛涵盖了该领域的最新进展,例如来自距离图像的 3D 对象检测、自/半/弱监督 3D 对象检测、端到端驾驶系统中的 3D 检测。与之前只关注点云检测的调查(Guo et al., 2020; Fernandes et al., 2021; Zamanakos et al., 2021)相比,来自单目图像(Wu et al., 2020a; Ma et al., 2022)和多模态输入(Wang et al., 2021h),我们的论文系统地研究了所有感官类型和大多数应用场景的 3D 对象检测方法。这项工作的主要贡献可以总结如下:
通过多方位的分析与总结,我们对3D对象检测方法进行了全面梳理与评述。具体而言,在感知方式方面涵盖了基于LiDAR、基于相机以及多模态感知等多种感知方式;在时间序列数据处理方面展开了深入探讨;就标签识别效率提出了优化建议;并对其在自动驾驶技术开发中的重要应用领域进行了重点阐述
我们对 3D 对象检测方法进行了系统性地整理和分类,并从多个维度对现有技术进行了深入探讨。同时,在各类方法的基础之上深入剖析了其优势与局限性,并为不同类别的方法的潜力与挑战提供了有价值的见解。
我们对3D对象检测方法进行了全面性能与速度评估,并揭示了多年来的研究趋势及对未来方向所提出的深入见解。
本文的结构组织如下:首先,在第2节中我们阐述了3D对象检测的问题定义、数据集以及评估指标。随后我们进行了深入分析与系统探讨基于LiDAR传感器的3D对象检测方法(第3节)、相机(第4节)、多传感器融合(第5节)以及基于Transformer架构的设计(第6节)。接着我们介绍了通过第七章所述时态数据实现的对象检测方法,并在第八章中展示了其在数据量有限情况下的有效性。随后我们在第九章中讨论了应用于自动驾驶系统的三维物体检测的关键技术问题,并对研究进展进行了系统性总结与分析。最后我们对第三章至第六章的主要研究成果进行了综合评估,并对未来研究方向展开了深入探讨。
三、背景
(一)原文
2.1 What is 3D Object Detection?
Problem definition 3D object detection targets the prediction of bounding boxes for 3D objects within dynamic environments based on sensor data. A general formula of 3D object detection can be mathematically formulated as

where B = {B1, · · · , B N } is a set of N 3D objects in a scene, f det is a 3D object detection model, and Isensor is one or more sensory inputs. How to represent a 3D object Bi is a crucial problem in this task, since it determines what 3D information should be provided for the following prediction and planning steps. In most cases, a 3D object is represented as a 3D cuboid that includes this object, that is

where (x c,y c,z c) represents the 3D coordinates of a cuboid's center point; length l corresponds to width w and height h respectively; θ signifies the heading angle on the ground plane. In Caesar et al. (2020), additional parameters vx and vy are introduced to describe an object's velocity along x and y axes on the ground.
Sensors come in various types, each capable of supplying raw data for 3D object detection. Among the available sensor technologies, radar systems, camera-based systems, and LiDAR systems are among the most commonly used for 3D object detection applications. Known for their extensive detection ranges and resilience to varying weather conditions, radar systems serve as a reliable choice. Despite their low cost, camera-based systems offer versatility in capturing visual information. Each camera produces an RGB image represented as Icam ∈ R^{W×H×3}, where W and H denote the image width and height respectively. However, cameras primarily capture appearance information through visual data rather than providing direct structural insights into a scene. Additionally, while 3D object detection necessitates precise spatial localization—often requiring depth information—camera-derived data typically contains significant measurement inaccuracies due to factors like perspective distortion. Furthermore, image-based detection methods tend to be sensitive to environmental factors such as lighting conditions or atmospheric disturbances. Object detection from images captured under challenging conditions such as low light or fog presents significant difficulties compared to detections made under ideal weather circumstances. Radar technology overcomes these limitations by directly measuring distance using electromagnetic waves.
As an alternative solution, LiDAR sensors acquire detailed 3D scene structures by radiating laser beams and analyzing their reflective properties. A LiDAR sensor that emits m beams and captures measurements over n scans within a single cycle produces a range image I_range ∈ R^{m×n×3}, where each pixel encapsulates distance r, azimuth angle α, inclination angle φ in spherical coordinates alongside reflectivity data. Range images represent raw data collected by LiDAR sensors and can be transformed into point clouds through spherical to Cartesian coordinate conversion. A point cloud is defined as I_point ∈ R^{N ×3}, where N denotes the number of points in a scene, with each point comprising three-dimensional (xyz) coordinates. Both range images and point clouds offer precise 3D information directly captured by LiDAR sensors. In contrast to cameras, LiDAR systems excel at detecting objects in three-dimensional space but are more susceptible to variations in time and weather conditions. However, due to their higher cost compared to cameras, LiDAR technology's widespread application in automotive environments remains constrained. Refer to Figure 2 for an illustration of 3D object detection processes.

Research analysis systematically compares 2D and 3D object detection approaches within the domain of computer vision. While traditional methods for detecting objects in two dimensions focus on generating axis-aligned bounding boxes on image data, this core issue has inspired analogous solutions for three-dimensional space. However, it must be noted that modern techniques for detecting objects in three dimensions are not merely naive extensions of their two-dimensional counterparts into three-dimensional space. This necessitates the development of tailored operators and networks capable of processing irregular point cloud data. Moreover, these methods typically involve leveraging multiple distinct projected views for accurate predictions. For instance, approaches often consider perspectives such as bird's-eye view, point view, and cylindrical view to ensure comprehensive object recognition. Furthermore, it is important to highlight that accurate localization in three-dimensional space represents a significantly higher technical challenge compared to its two-dimensional counterpart. Specifically, even a decimeter-level positioning error can result in critical failures when detecting small objects like pedestrians or cyclists. In contrast, in two-dimensional space, even relatively large localization errors (measured in several pixels) may still enable high Intersection over Union (IoU) scores between predicted and ground truth bounding boxes. Thus, precise geometric information becomes an indispensable element in achieving reliable results for both point cloud-based and image-based 3D object detection systems.
Analysis: comparisons with indoor 3D object detection
从分析的角度来看,在室内三维物体检测领域有诸多相关研究(Qi et al., 2018, 2019, 2020; Liu et al., 2021d)。该领域通过利用扫描反射光栅传感器(LiDAR)和图像捕捉三维场景中的物体特征,并基于ScanNet(Dai et al., 2017)、SUN RGB-D(Song et al., 2015)等数据集对室内场景中的三维结构进行建模与标注。然而,在自动驾驶场景中进行三维物体检测相较于室内场景存在独特挑战:第一,在自动驾驶场景中LiDAR传感器周围点云分布密集且覆盖范围广度大;而室内场景中点云主要集中在LiDAR传感器附近区域分布较为稀疏;第二,在自动驾驶场景中由于需要保证实时性以避免交通事故的发生,则要求检测算法具有较高的计算效率;因此,在实际应用中这种方法难以直接应用于室内外不同场景下。
2.2 Datasets
A large number of driving datasets have been built to provide multi-modal sensory data and 3D annotations for 3D object detection. Table 1 lists the datasets that collect data in driving scenarios and provide 3D cuboid annotations. KITTI (Geiger et al., 2012) is a pioneering work that proposes a standard data collection and annotation paradigm: equipping a vehicle with cameras and LiDAR sensors, driving the vehicle on roads for data collection, and annotating 3D objects from the collected data. The following works made improvements mainly from the 4 aspects.(1) Increasing the scale of data. Compared to Geiger et al. (2012), the recent large-scale datasets (Sun et al., 2020c; Caesar et al., 2020; Mao et al., 2021b) have more than 10x point clouds, images and annotations. (2) Improving the diversity of data. Geiger et al. (2012) only contains driving data obtained in the daytime and in good weather, while recent datasets (Choi et al., 2018; Chang et al., 2019; Pham et al., 2020; Caesar et al., 2020; Sun et al., 2020c; Xiao et al., 2021; Mao et al., 2021b; Wilson et al., 2021) provide data captured at night or in rainy days. (3) Providing more annotated categories. Some datasets (Liao et al., 2021; Xiao et al., 2021; Geyer et al., 2020; Wilson et al., 2021; Caesar et al., 2020) can provide more fine-grained object classes, including animals, barriers, traffic cones, etc. They also provide fine-grained sub-categories of existing classes, e.g. the adult and child category of the existing pedestrian class in Caesar et al. (2020). (4) Providing data of more modalities. In addition to images and point clouds, recent datasets provide more data types, including high-definition maps (Kesten et al., 2019; Chang et al., 2019; Sun et al., 2020c; Wilson et al., 2021), radar data (Caesar et al., 2020), long-range LiDAR data (Weng et al., 2020; Wang et al., 2021j), thermal images (Choi et al., 2018).
Analysis: future prospects of driving datasets
2.2 Evaluation Metrics
Existing evaluation metrics have been developed for assessing the performance of various 3D object detection methods. These metrics can be categorized into two groups. The first group has made efforts to expand the Average Precision (AP) metric, originally used in 2D object detection by Lin et al., from its application in a flat, two-dimensional plane into a three-dimensional space.

where p(r) represents the precision-recall curve, equivalent to that proposed by Lin et al. in 2014. The primary distinction between this metric and the 2D AP metric lies in how precision and recall are matched between ground truth data and predicted outcomes. KITTI Geiger et al. (2012) introduced two widely-used AP metrics: A P3D and A PBEV. A P3D establishes a match between predicted and ground truth objects based on whether their 3D Intersection over Union (IoU) surpasses a predetermined threshold, while A PBEV relies on IoU calculated from bird's-eye view (BEV). In the NuScenes framework, A Pcenter is defined such that a predicted object is matched to a ground truth object if their center distances fall below a specified limit. Furthermore, NuScenes Detection Score (NDS) extends this concept by incorporating additional parameters like size, heading, and velocity into its calculations. Waymo's approach utilizes the Hungarian algorithm for matching predictions and ground truths, introducing an AP variant that weights heading errors into the calculation.
The category of approaches under consideration seeks to address evaluation challenges from a more practical standpoint. The underlying principle posits that 3D object detection quality must align with downstream tasks, such as motion planning, ensuring optimal detection methods benefit these tasks for enhanced driving safety in real-world applications. To achieve this objective, PKL (Philion et al., 2020) employs KL-divergence by measuring discrepancies between predicted and actual detections concerning an ego vehicle's anticipated states. SDE Deng et al. (2021a) instead focuses on support distance by determining minimal distances from objects' boundaries relative to an ego vehicle and assesses associated errors in measurement.
Analysis: pros and cons of different evaluation metrics
(二)翻译
2.1 3D目标检测是什么?
问题定义 3D 物体检测主要通过传感器数据在驾驶场景中识别并定位三维物体的边界框。其通用数学模型可表示为

其中 B = {B1, · · · , BN } 是场景中一组 N 个 3D 对象,f det 是 3D 对象检测模型,ISensor 是一个或多个感官输入。如何表示 3D 对象 Bi 是此任务中的一个关键问题,因为它确定应该为以下预测和规划步骤提供哪些 3D 信息。在大多数情况下,3D对象被表示为包含该对象的3D长方体,即

其中(xc, yc, zc)表示长方体在水平面上的空间中心位置,l,w,h分别代表长方体的长度、宽度与高度;方向角即用于描述该长方体在水平面上的方向;类别属于三维物体类型如汽车卡车行人等.根据Caesar等人(2020)的研究发现他们在该研究领域引入了两个辅助参数vx与vy分别用于描述沿地面x轴与y轴方向上的速度分量.
感官输入 多种类型的传感器能够为三维物体检测提供原始数据。在各类传感器中,雷达、相机以及激光雷达(基于光探测与测距)技术是被广泛采用的三种主要类型。其中,在实际应用中发现: radar 具有较长的探测距离,并且能够在多变天气条件下表现出较强的适应性; camera 则因其经济性和易用性,在理解场景语义(如交通标志类型)方面发挥着重要作用;尽管 camera 能够生成图像 Icam∈RW ×H ×3 用于 3D 目标检测过程;但其局限在于:首先,摄像头仅能获取外观信息,而无法直接获取场景的整体三维结构信息;其次,该从图像中提取三维信息的过程往往容易受到极端天气或时间条件的影响;因此,实现可靠的自动驾驶系统仍面临诸多挑战。
作为另一种解决方案,在自动驾驶领域中还有一种技术路径值得探讨:利用激光雷达传感器通过发射激光束并测量其反射信息来实现场景的细粒度三维重建。通过发射m个光束并在一个扫描周期内执行n次测量操作的LIDAR传感器能够生成距离图像I_range ∈ R^{m×n×3}。其中,在球坐标系中记录了距离r、方位角α和倾角φ以及对应的反射强度值;这些数据构成了距离图像的基础信息载体。值得注意的是,在本研究中所讨论的距离图像是基于激光雷达传感器采集的真实原始数据;通过将球坐标系转换为笛卡尔坐标系的方式进行进一步的数据变换,则能够得到点云数据集I_point ∈ R^{N×3}(其中N代表场景中的点总数)。值得注意的是,在这一过程中所获得的点云数据集不仅保留了原始三维空间中的几何结构信息,并且相较于相机系统而言具有更高的精度和可靠性;相比于传统的相机系统来说,在感知三维空间中的物体时更为准确且不易受到时间或天气条件的变化影响较大程度上来说这种技术路径更具优势性
与传统的二维目标检测相比,在三维空间中进行目标检测的方法主要借鉴了诸多设计思路:包括生成与细化过程、锚定点位以及非最大值抑制等技术手段。然而,在多维度感知方面存在显著差异:三维目标检测系统必须应对不规则数据形式所带来的挑战,在点云数据中进行精确检测需要开发专门的算子和网络结构;而在结合点云与图像的数据场景下,则需要构建特殊的融合模块以实现信息的有效整合。(1) 三维目标检测系统通常会采用多角度投影视角来完成目标识别任务:与基于透视投影的传统二维方法不同,在三维空间中不仅需要考虑正视图还需要兼顾鸟瞰图等多种投影视角。(2) 此外,在定位精度方面三维技术对空间位置的要求更为严格:即使在分米级的位置误差也可能导致行人或自行车等小型物体被误判;相比之下二维目标检测系统由于其对像素级别的精确计算能力仍然能够在边界框预测与地面真实边界框之间维持较高的交并率(IoU)。(3) 因此获取准确的空间几何信息成为实现三维目标检测的关键要素
分析:与室内三维对象检测的对比研究 在室内的三维对象检测领域还存在一个重要的研究方向(Qi等人, 2018, 2019, 2020;Liu等人, 2021)。该领域利用来自扫描网(ScanNet, Dai等人, 2017)和深度彩色图(SUN RGB-D, Song等人, 2015)等数据集构建了丰富的三维模型,并基于点云或图像实现了室内外物体的三维重建与分类任务。相比之下,在驾驶场景中进行三维对象检测面临独特的挑战。(1)激光雷达与RGB-D传感器所获取的点云分布存在显著差异。在室内的扫描环境中,点云较为均匀地分布在扫描表面,并能较好地捕捉到表面附近的物体特征。然而,在自动驾驶场景中,则主要依赖于激光雷达附近密集的点云区域获取信息。因此,在自动驾驶任务中如何有效处理稀疏且分布不均的远距离物体成为关键问题。(2)自动驾驶任务对系统实时性要求极高。为了确保行车安全,在实时性方面必须做出严格要求:车辆感知系统必须实现低延迟处理以避免潜在事故的发生;因此,在这一领域的研究工作必须注重算法效率设计并实现高性能计算需求
2.2 数据集
构建了大量驾驶数据集,并提供了多模态传感器捕捉的数据以及3D标注信息。表1详细列出了在驾驶场景中收集数据并提供3D长方体标注的数据集情况。KITTI(Geiger等人, 2012)是一项开创性的工作,在该研究中提出了统一的数据采集与标注标准:为车辆配备摄像头和激光雷达等多模态传感器,在道路上进行数据采集,并从采集的数据中对3D物体进行标注。以下研究主要从四个方面进行了改进。(1)扩大了规模:相较于Geiger等人(2012)的研究,在最近的大规模数据集中(Sun等人, 2020c;Caesar等人, 2020;Mao等人, 2021b)获得了超过10倍的点云、图像及标注信息。(2)增强了多样性:与之前的研究仅限于白天和恶劣天气下的数据不同,在最新研究中(Choi等人, 2018;Chang等人, 2019;Pham等人, 2020;Caesar等人, 2020;Sun等人, 2020c;肖等人, 20年;毛等人, 7月)提供了夜间或雨天捕捉的数据。(3)丰富了类别标注:一些新研究(Liao等人, 6月;Xiao等人, 暑假;Geyer等人, 8月;Wilson等人, 潜伏期;Caesar等人, 夏季)不仅增加了细粒度物体类别(如动物、障碍物、交通锥等),还细化了现有类别的子类别(例如将成人与儿童分别归入人类类别)。(4)拓展了感知模式:除了传统的图像与点云外,在最新研究中还引入了更多感知类型(Kesten等人, 基因组学;Chang等人&Sun等.;Weng等.;Wilson等., 热成像)。
分析:驾驶数据集的未来前景 研究界见证了自动驾驶场景中 3D 对象检测数据集的爆炸式增长。可以问以下问题:下一代自动驾驶数据集看起来是什么。考虑到 3D 对象检测不是一个独立的任务,而是驱动系统中的组件,我们建议未来的数据集将包括自动驾驶中的所有重要任务:感知、预测、规划和映射,作为一个整体和以端到端的方式,以便从整体和系统的角度考虑 3D 对象检测方法的开发和评估。有一些数据集(Sun et al., 2020c; Caesar et al., 2020; Yogamani et al., 2019)致力于这个目标。
2.3 评估指标
现有多种评估指标被用来衡量3D物体检测技术的能力。这些指标主要可分为两大类。其中一类旨在扩展用于二维目标检测的标准平均精度(AP),其研究基础可追溯至Lin等人在2014年的文献(Lin et al., 2014)。

其中 p(r) 与 (Lin et al., 2014) 中的精确召回曲线一致。在度量核心区别在于计算精度与召回率时对真实物体和预测结果之间匹配的标准不同。KITTI Geiger等人(2012)提出了两种常用的AP指标:P3D和A PBEV。其中当两个长方体在3D空间中的交集与其并集(即3D IoU)超过某一阈值时,AP3D将把预测对象与各自的地面真相配对;而PB E V则基于鸟瞰图中(BEV IoU)衡量两个长方体交并比的标准。对于NuScenes Caesar等人(2020),他们提出了一个称为Pcenter的新指标;而Waymo Sun等人(2020c)则提出了一种基于匈牙利算法的新匹配方法。
另一类方案则着重于从实际应用场景出发来解决评估问题。这一思路认为,在3D物体检测质量与其后续应用领域(如运动规划)之间存在密切关联。因此最佳检测方法应尽可能地促进规划者在实际操作中保证驾驶安全。研究团队 PKL (Philion 等人, 2020) 利用两种不同的检测方法——基于预测模型与地面实况分析——来计算自驾车未来行为计划状态与 KL 散度之间的差异。研究者 SDE Deng 等人(2021a)则采用了将物体边界至自驾车最近点的距离作为支撑基准,并量化了支撑距离偏差。
分析:不同评估指标的优缺点
(三)理解
3D目标检测可以看成一个函数

该函数即为用于检测的对象模型;其输入端接收多种传感器采集的数据流作为基础感知信息源;包括激光雷达扫描得到的空间环境分布数据、摄像头实时捕捉的画面信息等;输出端则会生成N个独立的目标体;每个目标体均包含三维坐标位置(X,Y,Z)、长度L、宽度W以及高度H三组基本参数;此外还伴随有相对于视角的方向偏移量θ等辅助特征描述信息

和物体类别class。

在该图中直观展示了各种不同类型的输入设备及其对应的输入数据形式;传统摄像头通常输出二维、平面化的图像数据;而激光雷达则通过周期性扫描获取空间信息;将其按照球坐标转为笛卡尔坐标系后生成的是一个完整的点云数据集合;这种三维空间表示方法能够有效捕捉物体表面细节;最终所得出的点云图像通常由大量离散点组成

在2\text{D}的目标检测中,通常情况下仅需标注位置信息以完成任务,并无需考虑物体朝向与高度的数据。相比之下,在3\text{D}目标检测中则会关注这些细节。

可以看到,3D目标检测框具有更多的参数来表示物体的高度和偏移角。
未完待续…… 大家可以订阅专栏及时获取更新消息!
