单眼测试_单眼鸟瞰自动驾驶语义分割
单眼测试
Autonomous driving depends on an accurate depiction of the surroundings surrounding the ego vehicle. The surroundings consist of stationary features like road layouts and lane markings, as well as moving components such as other motor vehicles, pedestrians, and various road users. Stationary aspects are accurately represented in a high-definition (HD) map that contains detailed lane-level information.
无人驾驶系统必须准确描绘其车辆周围环境的细节。 在无人驾驶环境中,周围的情况由静态特征如道路规划和车道线组成,而动态特征则包括其他车辆、行人以及其他交通参与者。 通过高清地图上的车道分类信息,我们可以清晰识别出所有静态物体的位置和形态。
分为离线和在线两种类型的映射方法。
对于离线映射而言,在线映射还有一种常见的应用场景即是利用深度学习技术实现其功能。
请参考我的上一篇博客文章我的上一篇博客文章,其中详细讨论了深度学习在地图绘制中的应用。
当车辆无法依赖地图支持或从未访问过该区域时,在这种情况下在线映射会非常有用。
其中一种传统的方法是 SLAM(同时定位与建图),这种方法主要通过摄像头捕捉图像并进行处理来实现。
然而还有一种改进型的方法加入了物体识别的概念关于这种改进型方法。
有两种不同类别的映射方法:离线型和在线型。
关于离线地图及其在自动驾驶中的应用,请参考我的上一篇文章:
https://towardsdatascience.com/deep-learning-in-mapping-for-autonomous-driving-9e33ee951a44。
当缺乏地图支持或自动驾驶汽车未曾访问过的地方时,在线制图将非常有用。
在线制图中的一种常见方法是Simultaneous Localization and Mapping(SLAM),它基于图像序列中的几何特征检测与匹配;另一种方法则利用单目相机捕捉动态物体的运动扭曲。
The post will explore an alternative method for online mapping: bird’s-eye-view (BEV) semantic segmentation. Unlike SLAM, which relies on sequential images captured by a single camera in motion over time, BEV semantic segmentation utilizes images from multiple cameras mounted on a vehicle, each capturing data from different directions simultaneously. It thus enables extraction of more meaningful information from a single data capture compared to SLAM. Additionally, BEV remains effective even when the ego vehicle is stationary or moving slowly; whereas SLAM may perform poorly or fail under such conditions.
本文旨在详细阐述一种在线映射的新方法——鸟瞰法(BEV语义分割)。相较于基于单台运动摄像机随时间推移采集多帧图像的SLAM技术而言,在这种情况下该方法依赖于多个摄像头同时朝向车辆的不同方位拍摄图像数据。因此,在一次数据采集过程中该方法可获取比SLAM更多的有用信息。值得注意的是,在汽车静止或缓慢行驶的情况下,“尽管如此”但BEV语义细分仍能发挥作用而不会像SLAM那样表现差或失败。
为什么选择BEV语义图? (Why BEV semantic maps?)
Typically, autonomous driving systems employ Behavioral Forecasting and Organization within a bird’s-eye perspective (alternatively known as a bird’s-eye-view or BEV). Since high-priority data is more critical than secondary information, most essential data for an autonomous vehicle can be effectively captured within the BEV framework. The associated 3D spatial representation is commonly referred to as the 'BEV space' or 'bird’s-eye perspective'. (For instance, object recognition within this framework is often termed 'BEV-based localization,' distinguishing it from comprehensive 3D object recognition.)
在典型的自动驾驶系统中, 行为预判与规划 一般是在上层视角下的视图(或鸟瞰图, BEV)中完成的**, 因为细节层次的信息相对次要, 而自动驾驶汽车所需的主要信息大多可在该视图中方便提取. 因此, 这一特定的空间域可统称为三维空间域. (例如, 在该领域内的目标检测通常被称为三维定位操作, 而成熟的三维目标检测方法则有不同的应用场景.)
It is therefore standard to rasterize high-definition maps into a bird's-eye view (BEV) image and integrate advanced dynamic object detection techniques for behavior prediction planning. Recent studies have explored this approach through various implementations, such as IntentNet ( Uber ATG, 2018), ChauffeurNet ( Waymo, 2019), Rules of the Road ( Zoox, 2019), and the Lyft Prediction Dataset ( Lyth科技, 2020). Many other projects have also adopted similar strategies to address the challenges in autonomous driving systems.
因此,在行为预测计划中采用BEV图像作为基础图层,并将其与动态对象检测技术进行融合的常规做法已成为自动驾驶领域的重要研究方向。探索该策略的最新研究包括(IntentNet)(Uber ATG, 2018),(ChauffeurNet)(Waymo, 2019),(道路规则)(Zoox, 2019),以及(Lyft Prediction Dataset)(Lyft, 2020)等。

Image edited by the author, sources from references cited in the text)
Various traditional computer vision tasks, such as object detection and semantic segmentation, require making estimations within the same coordinate system as that of the input image. As a result, perception processes in autonomous driving systems generally occur within a spatial framework mirroring that captured by onboard cameras—referred to here as perspective view space.
经典的计算机视觉任务(包括但不限于目标检测与语义分割)均需在同一输入图像对应的坐标系统内完成对物体或区域的估计。进而可知,在自动驾驶技术体系中,
其感知模块(Percept)的堆栈一般位于与其安装于车辆上的摄像头图像所对应的三维空间之中。

SegNet) while Planning happens in BEV space (right: SegNet )中,而计划发生在BEV空间(右: NMP) ( NMP )中( source) 源)
The gap between perception and downstream tasks like prediction and planning is conventionally addressed in the Sensor Fusion stack, which maps the 2D observation in perspective space to 3D or bird's eye view (BEV), typically utilizing active sensors such as radar or lidar. While this approach may seem less intuitive for multi-modal perception, it offers several advantages. First, it provides interpretability, making it easier to debug inherent failure modes for each sensing modality. Second, it simplifies extension to new modalities and facilitates late fusion tasks. Additionally, as previously mentioned, the perception results in this representation can be readily utilized by prediction and planning modules.
通常在多传感器融合架构中填补用于感知与下游应用(如预测与规划)之间表示差异的有效途径,在有源传感器(如雷达或激光雷达)的帮助下将二维视角转化为三维或鸟瞰图(BEV)空间表现。换句话说,在贝叶斯估计方法的基础上进行优化设计能够有效提高定位精度与可靠性;此外,在这种框架下构建高精度地图同样具有重要意义
实现RGB视角图像向BEV的转换(将RGB视角图像转换为BEV)
The data from active sensors like radar or lidar are suited for BEV representation because their measurements are inherently metric in 3D. Despite their widespread availability and low cost, surround-view camera sensors have led to a significant focus on generating BEV images with semantic meaning.
由自动传感器(例如雷达或激光雷达)收集的数据更适合采用BEV表示法。这是因为这些数据的本质是三维度量的结果。尽管环视相机传感器因其普及性和低成本而广泛存在,但生成具有语义意义的BEV图像这一问题近年来已引起了广泛关注。
In the title of this post, “monocular” originates from the fact that pipeline inputs are images obtained from monocular RGB cameras without explicit depth information. The monochromatic RGB images captured on board autonomous vehicles represent perspective projections of three-dimensional space. The inverse challenge of transforming two-dimensional perspective observations into three dimensions is an inherently ill-posed problem.
其中'单眼'具体指通过单一通道(如RGB相机)捕捉到的图像数据,并未包含深度信息。由自动驾驶汽车获取并生成的一维视角RGB图像数据作为对三维空间进行透视投影的结果;而将其二维视角观察转化为三维结构的过程(即逆过程)作为典型的不适定问题具有高度不稳定性。
IPM及其它面临的挑战 (Challenges, IPM and Beyond)
视图变换是BEV语义分割中一个显著的挑战。为了恢复3D空间的BEV表示,算法必须依赖于几何先验知识(如相机内参数和外参数),这些先验分为硬性且可能有噪声的几何知识以及软性知识(如道路布局的知识库和常识),其中常识包括车辆在BEV表示中不会重叠等特性。传统的逆透视映射(IPM)方法在这一任务中表现优异(假设地面平坦且相机外参数固定)。然而,在非平面表面或路面不平的情况下,“此方法在相机外参数变化时对于非平面表面或路面不平的情况表现不佳。”
BEV语义分割面临显著的视图转换问题。 该算法需借助硬性但可能存在噪声的几何先验信息以及软性知识库和通用模型来还原3D空间中的BEV表示。 感知(汽车在BEV中不会重叠等)。 按照惯例,在假设地面平坦且摄影机位置固定的情况下IPM已成为解决该问题的主要方法。 然而,在相机外部特性发生变化时这一方法并不适用于非平坦表面或崎岖道路场景。

Another major hurdle involves the acquisition of the data and annotations required for such a task. One major approach is to utilize a drone that consistently follows an autonomous vehicle at all times, akin to MobileEye’s presentation at CES 2020. This involves requesting human-annotated semantic segmentation from participants. A significant drawback of this method is its lack of practicality and scalability. Many research efforts have resorted to using synthetic datasets or relying on unpaired map data as substitutes for real-world training datasets.
另一个主要问题是针对此类任务的数据收集和标注过程。一种方法是将无人机设计为始终跟踪自动驾驶汽车(类似于MobileEye的CES 2020演讲),随后通过人工标注实现语义分割任务。然而这种方法明显缺乏实用性和扩展性。许多研究依赖生成式数据集或未配对的地图数据来进行算法优化。
In subsequent sessions, we will conduct a comprehensive review of recent advancements in this field and examine their shared characteristics. These studies can be significantly categorized into two primary categories based on their use of supervision signals. The first category employs simulation as a means for indirect supervision, while the second category makes direct use of recently released multi-modal datasets for direct supervision.
将在本次会议上对本领域的发展动态进行全面总结,并着重分析其共性特征。这些研究主要依据所使用的监督信号可分为两类。第一类研究通过模拟手段实现间接监控,第二类研究则主要基于最近发布的多模态数据集进行直接监控。
仿真和语义分割 (Simulation and Semantic Segmentation)
Ground-breaking research in this field employs simulation-generated essential datasets and annotations to transform 3D perspective images into bird's-eye view. To address the discrepancy between simulated environments and real-world scenarios (sim2real), numerous studies have employed semantic segmentation as a key intermediate representation.
该领域具有开创性的研究通过模拟生成所需的数据与注释,并以此实现透视图图像的转换至BEV(behind-the-vehicle)。 以弥合仿真与现实领域之间的鸿沟为目标,许多研究者采用语义分割作为中间表示方法。
VPN(查看解析器网络,RAL 2020) (VPN (View Parser Network, RAL 2020))
Among the initial efforts in exploring BEV semantic segmentation, the View Parsing Network (VPN) refers to this approach as cross-view semantic segmentation. Within the View Parsing Network (VPN), a dedicated view transformer module is employed to model the transformation from perspective to BEV. Specifically, this module is structured as an MLP that expands the 2D physical dimensions into a 1D vector before conducting a fully connected operation. The system effectively disregards strong geometric priors, instead relying solely on data-driven methods to learn how perspectives warp into BEV representations. This warping relationship is specific to each camera, necessitating individual network training for every camera setup.
VPN被视为跨视图语义分割领域的开山之作,并被正式命名为"跨视图语义分割"。 视图解析网络(VPN)通过其视角转换器模块实现了透视到BEV空间的转换建模。 该模块采用多层感知机的形式设计,并将输入的空间范围映射为一维向量后对其进行全连接处理。 此方案摒弃了几何先验知识,在数据驱动下学习透视与BEV之间的形变关系。 每个摄像头都有对应的网络架构来处理其特定视角下的数据。

VPN employs synthetic data (generated with CARLA) and adversarial loss for domain adaptation during training. Furthermore, it utilizes a semantic mask as an intermediate representation without the photorealistic texture gap.
在训练阶段中利用合成数据(基于CARLA生成)与对抗性损失来进行领域适应;此外该方法采用语义掩膜作为中间表示并避免了真实感纹理间隙的影响
The input and output dimensions of the view transformer module are identical.
The paper notes that this design facilitates easy integration into other architectures.
However, in my opinion, it is actually unnecessary,
since perspective views and bird's-eye views represent fundamentally different spatial domains,
enforcing identical pixel formats or matching aspect ratios between input and output is unnecessary.
Code is available on github.
捕鱼网(CVPR 2020) (Fishing Net (CVPR 2020))
Fishing Net convert lidar, radar, and camera fusion in a single unified representation in BEV space. This representation makes it much easier to perform late fusion across different modalities. The view transformation module (the purple block in the vision path) is similar to the MLP-based VPN. The input to the view transformation network is a sequence of images, but they are just concatenated across the channel dimension and fed into the network, instead of leveraging an RNN structure.
钓鱼网通过BEV空间采用单一表示形式将激光雷达数据映射到同一表征中,并实现了雷达与摄像头数据的融合。 该表征使得跨模式的数据融合更加便捷。 视图变换模块(位于视觉路径中的紫色块)等同于基于多层感知机(MLP)的虚拟点导航网络(VPN)模块。 视图变换网络的输入是图像序列;然而,在处理过程中,并非所有特征都直接连接到RNN结构中。

The groundtruth generation is generated via 3D annotation in lidar, primarily focusing on dynamic objects such as vehicles and VRU (vulnerable road users, including pedestrians and cyclists). The remaining instances are categorized under a background class.
该系统安装于激光雷达设备中,并配备有3D标注功能。该系统主要专注于动态物体的识别与跟踪,例如车辆和VRU(容易遭受伤害的道路使用者包括行人和骑行者)。其余所有则归类为背景元素。
BEV语义网格具有10厘米和20厘米/像素的分辨率。
这一数值显著低于离线映射中常用的4至5厘米/像素的标准。
遵循VPN所建议的惯例,在生成图像尺寸与输出分辨率(192×320)相匹配方面存在一致性。
CVPR 2020上的演讲视频可以在Youtube上找到。
VED(ICRA 2019) (VED (ICRA 2019))
VED (Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks) exploits a variational encoder-decoder (VED) architecture for semantic occupancy grid map prediction. It encodes the front-view visual information for the driving scene and subsequently decodes it into a BEV semantic occupancy grid.
采用带有卷积变分编码器-解码网络的单眼语义占用网格映射方案(如VED所示),该系统基于变体编码器-解码架构(VED)生成BEV语义图。 该系统将驾驶场景的前方视觉数据经由编码处理,并将其转化为后方视角下的语义覆盖网格。

The reference ground truth in this context is created using a disparity-based approach within the CityScapenet dataset. This method could introduce noise, which led to the adoption of VED alongside latent space sampling techniques to ensure the model remains robust against imperfect ground truth data. However, as a Variational Autoencoder (VAE), it tends to fail in producing sharp edges, likely due to its reliance on Gaussian priors and mean-squared error optimization.
基于真实视差图利用CityScape数据集中的立体匹配生成此地面真相。这一过程可能较为吵杂,并且会导致人们倾向于采用VED,并在潜在空间中进行采样。从而使得模型更加稳定。然而,在VAE的情况下,则通常不会产生清晰的边缘特征;这可能源于高斯先验带来的误差以及均方误差的影响。
输入图像与输出分别为256乘以512像素和64乘以64像素。
VED采用了香草SegNet架构(该基线方法在常规语义分割任务中表现出色)并在其中加入了单个1x2池化层的设计意图是为了应对输入与输出之间不同的比例关系。
学习环顾对象(ECCV 2018) (Learning to Look around Objects (ECCV 2018))
Mastering Object-Oriented Vision Techniques for 360-Degree Outdoor Scene Reconstruction generates perceived occluded regions within the bird's eye view, by utilizing simulated data and reference maps to enhance reconstruction accuracy.
学习观察周围区域的物体 以采集 outdoor scene's bird's-eye view](https://arxiv.org/abs/1803.10870),使 BEV 中的 occlusion regions generate visual effects and assist in achieving the task through simulated and map data.
It holds my opinion that this is a notable paper in the domain of BEV semantic segmentation, yet it appears to lack recognition. It seems plausible that it requires an innovative moniker.

The method for achieving visual transformation utilizes pixel-wise depth predictions followed by a projection into the bird's-eye view (BEV) space. This approach partially addresses the challenge of insufficient training data within the BEV domain. It is also employed in subsequent research, such as that discussed in [Lift, Splat, Shoot (ECCV 2020)].
实现查看转换的过程是通过逐像素深度预测来完成的,并将结果投影到BEV空间中。这种方法解决了BEV空间中缺乏高质量训练数据所带来的挑战。这一问题将在后续研究工作中进一步探讨和解决,请参阅下文中所提到的内容:升力摔跤射击
The method described in the paper for learning to generate hallucinations (specifically predicting occluded image portions) is truly remarkable. For dynamic objects where obtaining their ground-truth depth information remains challenging, we discard instances of lost data. By randomly masking certain image blocks and then asking the model to reconstruct these areas, we employ a loss function as a supervision signal for training purposes.
该论文专注于开发一种极为出色的方法来学习被遮挡部分的幻觉效果。(预测被遮挡的部分)对于GT深度难以找到的动态对象,在处理过程中去除难以获取的信息。(滤除损失)通过随机遮盖图像中的某些区域来引导模型生成合理的视觉感知结果。(随机掩盖图像块)利用损失函数来指导模型训练过程中的优化方向。(将损失用作监督信号)

In BEV space, it’s challenging to acquire explicit paired supervision, which motivated the research to employ adversarial loss for training using simulated environments and leveraging geospatial data from OpenStreetMap. This approach ensures that generated road layouts closely resemble real-world ones. This method has been utilized in subsequent studies, notably appearing in MonoLayout (WACV 2020), demonstrating its effectiveness through extensive experiments.
由于在BEV空间中找到明确的配对监管具有挑战性,在本文中我们采用了对抗性损失这一方法,并利用模拟过程以及OpenStreetMap数据进行指导学习以确保生成的道路布局看起来像真实的道路布局。此外,在后续工作中该技术也被应用于类似场景,请参考MonoLayout(WACV 2020)一文中的详细阐述
It utilizes a single CNN within the image domain to perform depth estimation and semantic understanding. The system then projects these predictions into 3D space and renders them in a bird's-eye view (BEV). A subsequent CNN operates within the BEV framework to refine the results. This refinement module within the BEV framework is utilized in numerous studies, including Cam2BEV (ITSC 2020) and Lift, Splat, Shoot (ECCV 2020), as referenced at https://arxiv.org/abs/2008.05711.
它基于图像的空间中采用一个CNN架构来实现深度信息与语义信息的预测,并将这些预测结果提升至三维空间,并通过BEV视图进行呈现。随后,在另一个位于BEV空间中的CNN架构上实施细节增强。该优化模块作为公共组件被用于多个研究项目中,例如Cam2BEV(ITSC 2020)以及Lift、Splat、Shoot(ECCV 2020)。
Cam2BEV(ITSC 2020) (Cam2BEV (ITSC 2020))

The research paper titled Cam2BEV (A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View), accessible via https://arxiv.org/abs/2005.04078, describes a method that utilizes a spatial transformer module combined with an Inverse Perspective Mapping (IPM) to map perspective features into BEV (Bird’s Eye View) space. This research focuses on advancing computer vision techniques by transforming images from various vehicle-mounted cameras into semantically segmented bird's eye view images through a deep learning framework. The neural network architecture involved captures four images captured by different cameras, and for each image, the system applies an IPM transformation prior to concatenation, enabling effective feature mapping and integration across diverse camera perspectives.
Cam2BEV(一种基于Sim2Real技术的深度学习算法)是一种用于将来自多辆汽车摄像头的数据转化为具有语义分割特性的鸟瞰视角图像的方法

Cam2BEV relies on synthetic datasets generated through VTD (virtual test drive) simulation environments. It accepts four semantic segmentation images and emphasizes the lifting phase, while concentrating on addressing the domain shift issue between simulated and real environments.
Cam2BEV基于VTD仿真环境中的合成数据模拟生成。它利用了四幅语义分割图像,并着重于提升流程以减少sim2real域间的差异。
该系统以其专注的设计范围和丰富的设计选项而著称,并且具有显著的效果。此外,在理论上它仅限于语义空间的处理,并因此避免了sim2real域间差距的问题。为了简化对遮挡区域的推理过程并使问题更加易于处理,在预处理阶段它会特意遮蔽这些区域。为了简化提升过程,在输入端它接收了一个经过IPM(基于语义分割结果生成)并整合到全视场BEV图像中的“homography image”。因此该系统的首要目标是推断BEV中三维物体的实际尺寸范围,在homography图像中这些物体可能呈现出延展性特征。
Cam2BEV高度集中且拥有多种设计选项, 使得其应用极为广泛. 首先, 它主要局限于语义空间内, 从而有效地规避了sim-to-real域间存在的差异. 其预处理阶段特别强调特意遮蔽遮挡区域, 并以此简化解决问题的过程. 将由IPM生成并整合成一个完整的360度BEV图像的单应性映射作为输入. 其主要功能即通过推断贝 Vor视图中三维物体的实际尺寸与位置来实现.

Cam2BEV employs a fixed Integral Projection Method (IPM), which has limitations when encountering uneven terrain or elevation shifts while driving. The system captures and processes visual data as images with specific resolutions. The source code for Cam2BEV can be downloaded from github.
您需要的只是(多峰)数据集 (All you need is (multimodal) datasets)
Just been the recent release of numerous multi-modal datasets, such as Lyft's Lyft dataset, Nuscenes' Nuscenes dataset, and Argoverse's Argoverse dataset, has opened up new possibilities for directly supervising the monocular BEV semantic segmentation task. These datasets offer both comprehensive 3D object detection data and high-resolution maps alongside localization information, enabling precise identification of the ego vehicle's position on the map at each timestamp.
多种多样的3D感知数据集(包括Lyft提供的Lyft、Nuscenes的Nuscenes以及Argoverse的Argoverse等)为实现单眼BEV语义分割监控任务提供了有力支撑。这些数据集不仅丰富了3D物体检测的信息量,并且还配备了高精度的地图信息以及车辆定位数据,在每一个高精度地图的时间戳都能实现精准的自我车辆定位。
The BEV segmentation task is divided into two key components: the dynamic object segmentation task and the static road layout segmentation task. In terms of object segmentation, three-dimensional bounding boxes are converted into a bird's eye view image to produce annotations. For static road layouts, maps are transformed into the vehicle's own coordinate system based on localization results and then rasterized to create annotations in a bird's eye view.
BEV细分任务分为两个方面:动态的对象细分任务和静态的道路布局细分任务。
对于对象分割而言,在完成3D边界框转换后将其栅格化处理以形成标注数据。
在处理静态道路布局时,则是利用提供的定位信息将地图重新定位至车辆自身坐标系中,并将其栅格化处理以生成BEV标注数据。
MonoLayout(WACV 2020) (MonoLayout (WACV 2020))
MonoLayout 是一种将单一图像转换为具有语义BEV空间能力的系统。研究的主要关注点在于对被遮挡区域进行推断的模态完成技术。该系统似乎受到了关于如何围绕物体进行视觉推理的先驱工作的影响。
MonoLayout :来自单个图像的模态场景布局着重于将单个摄像机提升到语义BEV空间中。 本文的重点是无模态完成,这是造成闭塞区域的原因。 似乎受学习环顾对象的影响很大(ECCV 2018) 。

The view transformation is achieved through an encoder-decoder framework, with the latent feature referred to as "shared context". Two decoders are employed to separately process static and dynamic categories. The authors also presented negative findings regarding the use of a combined decoder in handling both static and dynamic entities within an ablation study.
通过编码器-解码器架构进行编码转换,并将潜在特征命名为"共享上下文"。采用两个独立的解码模块分别对静态与动态类别进行处理。研究者指出,在消融实验中结合静态与动态物体的处理效果存在明显缺陷。

Though HD Map groundtruth is available in Argoverse dataset, MonoLayout chooses to use it only for evaluation but not for training (hindsight or deliberate design choice?). For training, MonoLayout uses a temporal sensor fusion process to generated weak groundtruth by aggregating 2D semantic segmentation results throughout a video with localization information. It uses monodepth2 to lift RGB pixels to point cloud. It also discards anything 5 m away from the ego car as they could be noisy. To encourage the network to output conceivable scene layout, MonoLayout used adversarial feature learning (similar to that used in Learning to Look around Objects). The prior data distribution is obtained from OpenStreetMap.
然而,在Argoverse提供的HD Map groundtruth中存在这一资源限制(事后考虑还是故意设计?),因此MonoLayout选择仅将其用于评估环节而不参与模型训练过程。为了实现模型训练目标,在此过程中MonoLayout采用了基于时空感知融合的方法来整合视频中各帧的2D语义分割结果与本地化信息源信息以生成弱基础估计框架;随后该方法又通过monodepth2网络对RGB像素进行深度估计以得到点云坐标;为避免不必要的计算开销与环境噪声干扰,在实际应用中该系统会自动过滤掉距离自身前端汽车5米以外的物体(这些区域可能会产生较大的环境干扰);同时为了促进网络输出更加多样化的场景布局描述能力,在模型架构设计中引入了对抗式特征学习机制(类似于Learning in Real Time by Looking All Around一文中所采用的技术)。值得注意的是早期数据分布来源于OpenStreetMap数据库
MonoLayout achieves a spatial resolution of 30 cm/pixel, thereby translating an output of 128×128 pixels into a BEV space covering approximately 40m×40m. MonoLayout's spatial resolution is consistent with the same scale, translating an output of 64×64 pixels into a BEV space covering approximately 20m×20m. Code is available in github.
PyrOccNet(CVPR 2020)(PyrOccNet (CVPR 2020))
PyrOccNet : Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks generates the BEV semantic map from monocular images and merges them into a consistent and unified representation through posterior distribution estimation framework.
PyrOccNet 基于金字塔占用网络推断出语义图表示形式自](https://arxiv.org/abs/2003.13402)单眼图像中的BEV语义图,并通过贝叶斯融合将其整合为一致的一致视图。

The central component of the view transformation in PyrOccNet is executed by a Dense Transformer module. It appears to have been significantly influenced by OFT (BMVC 2019), developed by the same authors. The OFT method uniformly spreads the feature at a pixel location along the ray projected back into 3D space, closely resembling the backprojection algorithm used in Computational Tomography. The Dense Transformer module in PyrOccNet advances this approach by employing an FC layer to expand features along the depth axis. In practice, multiple dense transformers operate at various scales, each focusing on distinct distance ranges within the Bird's-Eye View (BEV) space.
Pyramid Occlusion Network(PyrOccNet)中的图像转换核心功能主要由**密集变换器(Dense Transformer)**模块来执行。*这一技术深受同一作者之前关于OFT(BMVC 2019)工作的极大影响。* OFT在其对应的三维空间像素位置上均匀涂抹特征,在沿投影至三维空间的过程中实现了这一操作。*该模型被FC层沿着深度轴扩展后进一步优化了性能。*实际上采用了多个密度互感器,并根据不同的运行比例进行配置以适应不同场景的需求
The dense transformer layer is based on an observation that while requiring significant vertical context to align features with birds-eye-view (resulting from occlusion, limited depth information, and an unknown ground topology), it allows establishing relationships between BEV locations and image locations through simple camera geometry. — from PyrOccNet paper

来源于多模态数据集的训练数据来自于Argoverse和nuScenes两个数据库,并且包含地图数据和3D物体检测的标注信息。
训练数据来源于Argoverse和nuScenes两个多元数据集合,并且它们兼具地理信息与基于3D对象的真实世界检测技术
PyrOccNet employs Bayesian Filtering to combine information across multiple cameras and time points in a unified manner. It draws inspiration from the established concept of the binary Bayesian occupancy grid and enhances the interpretability of its network outputs. The temporal fusion process closely mirrors a mapping mechanism, exhibiting similarities to the “temporal sensor fusion” approach employed in generating weak ground truth within MonoLayout.
PyrOccNet通过贝叶斯滤波将信息以系统化的方法融合到多个摄像机和整个时间段内。该方法基于二进制贝叶斯占用网格的传统概念,并进一步增强了网络输出结果的可解释性。时间域内的融合效果与空间映射过程具有高度相似性。其生成流程与基于单目视觉框架的时间传感器融合机制具有高度一致性。

PyrOccNet possesses a spatial resolution of 25 cm/pixel, which results in a 2{\times}2 meter square per pixel. The 199\times199 occ inclusion map is generated with an overlap ratio of 19\% for each adjacent pixel. The code will be made available via github.
展示位置
展示位置
举起,摔跤,射击(ECCV 2020)(Lift, Splat, Shoot (ECCV 2020))

Lift, Splat, Shoot : Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D performs dense pixel-wise depth estimation for the view transformation. It first uses per-camera CNNs to perform probabilistic pixel-wise depth prediction to lift each perspective image into a 3D point cloud and then uses camera extrinsics to splat on BEV. Finally, a BEV CNN is used to refine the predictions. The “shoot” part means path planning and will be skipped as it is outside the scope of this post.
该系统通过隐式地将图像投影到三维空间中(参考文献:https://arxiv.org/abs/2008.05711),实现了来自任意摄像机装备的数据编码,并在逐像素级上执行深度估计。该系统首先利用基于摄像机特性的CNN模型进行概率化的像素级深度预测,并将二维视角下的图像转换为三维点云数据;随后结合摄像机外部参数信息生成鸟瞰图(BEV)。最后通过BEV CNN模型进一步优化并完善了初始预测结果。其中路径规划部分对应于"拍摄"过程,在本文研究范围内不予涉及

The method introduced probabilistic 3D lifting via predicting depth distributions for each pixel in the RGB image. Essentially, it integrates the one-hot lifting technique for pseudo-lidar (CVPR 2019) with the uniform lifting approach from OFT (BMVC 2019). This "soft" prediction is fundamentally a common practice used in differentiable rendering. The subsequent work on Pseudo-Lidar v3 (CVPR 2020) also employs this soft rasterization technique to achieve differentiable depth lifting and projection.
基于RGB图像像素深度分布的预测研究中,在概率3D提升方面取得了显著进展
The training data is derived from the multimodal datasets belonging to both the Lyft dataset (https://self-driving.lyft.com/level5/data/) and the nuScenes dataset (https://www.nuscenes.org/), which contain map data and 3D object detection ground truth.
训练样本源自Lyft深度感知数据集以及nuScenes深度感知数据集所构建的多模态数据集合(https://www.nuscenes.org/),其中后者特别整合了地图信息与基于3D的对象检测技术以模拟地面真实情况。
Lift-Splat-Shoot has a input resolution of 128x352, and the BEV grid is 200x200 with a resolution of 0.5 m/pixel = 100m x 100m. Code will be available in github.
Lift-Splat-Shoot的输入分辨率为128x352,BEV网格为200x200,分辨率为0.5 m /像素= 100m x 100m。 代码将在github中提供。
局限性和未来方向 (Limitations and Future Directions)
Despite considerable advancements in BEV semantic segmentation, there are notable shortcomings that remain before its widespread deployment in production systems.
虽然BEV语义分段已经取得了显著进展,在实际应用过程中尚未完全实现其大规模部署之前,仍存在一些关键性的差距.
Initially, moving participants lack a concept of entity. This complicates leveraging prior knowledge about dynamic objects in behavior prediction. For instance, cars adhere to specific motion models (like bicycle models) and exhibit limited patterns in their predicted trajectories, while pedestrians exhibit more erratic and unpredictable movement patterns. A significant number of existing methods attempt to group multiple cars into a single contiguous region within semantic segmentation outcomes.
首先,在处理动态参与者的问题时,并未形成完整的具体实例概念。这导致在行为预测任务中难以直接应用这些对象的先验知识。例如,在交通场景中(如车辆遵循特定运动模型——类似于自行车),其未来轨迹的行为模式较为受限;而行人则表现出更为随机的行为模式。在语义分割的结果中,在现有方法中通常会将多辆车辆连贯地分组到同一区域。
The dynamic semantic classes cannot be reused and play a significant role in representing temporally unused data. In contrast, static semantic classes within the BEV image represent a real-time digital representation of the environment, which serves as a foundation for enhancing situational awareness. The static semantic classes include elements like road layouts and markings, which should ideally undergo harvesting and recycling. The challenge lies in aggregating BEV semantic segmentation across multiple timestamps to improve environmental understanding. While temporal sensor fusion approaches like those employed in MonoLayout and PyrOccNet show promise, their effectiveness must be evaluated against conventional methods such as SLAM.
动态语义类需避免重复使用,在很大程度上表现为一次性信息;另一方面,在BEV图像中可识别的静态语义类别(如道路布局与路标)可被视为实时地图的一部分,并应被收集与回用。如何从多个时间节点聚合BEV语义片段以推断出更精确的地图是一个关键问题;此外,在现有研究中MonoLayout及PyrOccNet的时间传感器融合方案可能具有借鉴意义;但必须将其与基于SLAM的传统方法进行对比评估以确保其优越性
How can we transform the online pixel-wise semantic map into a lightweight and structured format so it can be reused effectively in the future? To prevent wasting valuable mapping cycles on board, we must ensure that the online map is transformed into a format that allows self- and other vehicles to efficiently utilize it in the future.
这一过程旨在实现以下目标:将在线像素语义映射转化为一种轻量级且具有结构性的表示形式(即轻量化表示),以便于后续存储与应用。鉴于此,在开发过程中需充分考虑资源利用效率问题:必须确保原始地图数据能够被有效转换成适合自我驾驶汽车或其他车辆未来使用的特定格式。
带走 (Takeaway)
注
视图转换:许多现有工作都未充分考虑相机外部几何学的关键先验信息。必须采取措施避免这种情况发生。Pyramid_OCCNet和Lift-Splat-Shoot在这一领域表现最为理想。
- Data and Supervision: Majorities of prior research up until 2019 primarily utilized simulation data and adopted semantic segmentation as an intermediary approach to address the sim2real domain gap. Recent advancements have increasingly leveraged multi-modal datasets for direct task supervision, yielding impressive outcomes.
数据与监督: 早期研究主要依赖于模拟数据,并采用语义分段作为中间表示以缩小sim2real域的差距。 近期研究则利用多元化的数据集对任务进行直接进行监督,并取得了显著的效果。
个人感受是这样的:bev空间感知将成为感知领域的未来方向;尤其借助可微分渲染技术的引入,在端到端模型框架内实现视角变换模块化构建;通过整合到一个端到端模型中实现将视角图像直接转换至bev空间
我确实认为BEV空间中的感知技术将引领未来视觉领域的变革,在这一技术的支持下,视图转换过程能够被具体实现为一套差异化模块,并通过插入端到端模型的方式使得透视图图像得以直接转化并提升至BEV空间中。
单眼测试
