Advertisement

Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net

阅读量:

动机

本文利用浅层网络实现了3D目标检测、追踪及预测任务。基于BEV的方法进行表征。推测该论文针对视觉感知领域中的深度学习模型进行了研究。输入:4D张量(X,Y,Z,T) 输出:N张带预测的BEV图

备忘

This approach can lead to catastrophic consequences, as subsequent processes are unable to correct or recover from errors that emerge early in the pipeline. – In a hierarchical configuration, the downstream processing modules are unable to address errors originating from upstream multi-threaded components.

We believe this perspective is crucial as tracking and prediction capabilities can play a pivotal role in achieving object detection. Specifically, by integrating tracking and prediction data, we are able to effectively minimize false negatives in scenarios involving occluded or distant objects. Additionally, false positives can be mitigated through the systematic accumulation of evidence over time.
– 端到端的优势能够相互印证:检测、预测和跟踪

– 本文关于BBOX的预测方式借鉴了:
SSD: Single shot multibox detector

本文未采用三维卷积而采用了二维卷积的方式进行研究:Multi-view 3D object detection network for autonomous driving.

Voxel Representation

在这里插入图片描述

We then labeled a binary label for each voxel representing whether the voxel is occupied. We instead performed 2D convolutions and treated the height dimension as the channel dimension.

–[cc]比较经典的处理方式

Adding Temporal Information

在这里插入图片描述

We acquire each of the 3D points within the preceding n frames and execute a coordinate transformation to have them be represented in the current vehicle coordinate system.

[cc]按照当前帧对以前的帧进行坐标系变换。如何进行变换论文没讲!个人能想到的简单办法是根据当前车辆状态(6个 维度)得到R/T,去反算前几帧

We are capable of appending multiple frame sequences within the newly introduced temporal dimension by creating a 4D tensor.

Model Formulation

四维输入张量,并直接回归到物体边界框,在不同时间戳上;不使用区域建议。

在这里插入图片描述
  • Early Fusion
在这里插入图片描述

[CC]基础网络还是一个VGG16,进行了裁剪.

We initially employ a one-dimensional convolution kernel of size n to process the temporal dimension, effectively reducing its original length of n to a single feature.

– 使用一个1D Conv将时间序列连接起来了

  • Late Fusion
在这里插入图片描述

but instead, the system applies 3D convolution using a kernel of size 3 × 3 × 3 across two layers while avoiding padding operations in the temporal dimension, thereby effectively reducing the temporal dimension from an initial value of n to a final value of 1.

Motion forecasting

在这里插入图片描述

Following Figure 5, we incorporate two distinct convolutional branches into our network architecture. The first branch is designed for binary classification, aiming to estimate the likelihood of an object being a vehicle. The second branch extends predictions beyond the immediate frame by forecasting bounding boxes for not only the current frame but also up to n−1 subsequent frames.

【cc】这里两个分支网络没有描述

Following SSD [17], based on our research. In all cases there exists six predetermined bounding boxes assigned to each spatial position within a given feature map. These positions are mathematically represented by a[k,i,j], where i denotes the row index (ranging from 1 to I), j represents the column index (ranging from 1 to J), and k corresponds to one of K predefined box configurations.

[CC]早期YOLO浓浓的既视感

Notice that we do not use predefined heading angles

[CC]BBOX的朝向是作为预测值进行回归的

for each predefined box a[k,i,j], our network predicts the corresponding normalized location offset l_x,l_y, log-normalized sizes s_w,s_h and heading parameters a_sin and a_cos.

When there is overlap between detected objects from the current frame and past forecasts, these instances are regarded as belonging to the same object, with their bounding boxes being averaged straightforwardly.

CC

CC

CC

Loss Function and Training

在这里插入图片描述

where t is the current frame and w represents the model parameters.

We utilize H for classification purposes, representing the binary cross-entropy loss calculated across all locations and predefined boxes.

在这里插入图片描述

in this context, i, j, k represent indices corresponding to feature map positions and their association with predefined boxes. Specifically, q[i,j,k] denotes the class assignment (namely, q[i,j,k]=1 signifies a vehicle while q[i,j,k]=0 indicates background). Conversely, p[i,j,k] represents the predicted probability of a vehicle being present at that location.

【CC】分类损失函数是一个二值交叉熵,用来衡量BBOX分类是否正确

Thus we define the regression targets as

在这里插入图片描述

We utilize a weighted average of the smooth L1 losses for all regression targets where the smooth L1 loss is defined as:...

在这里插入图片描述

For each predicted box, we determine the ground-truth box with the highest IoU overlap, which serves as ¯ a[k,i,j] when its calculatedIoU exceeds a preset threshold (typically set at 0.4). We then assign q[i,j,k] = 1 to signify this assignment.

cc

cc

cc

全部评论 (0)

还没有任何评论哟~