毫米波感知论文阅读笔记：CVPR 2023, Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

阅读量：

原始笔记链接:https://mp.weixin.qq.com/s?__biz=Mzg4MjgxMjgyMg==&mid=2247486705&idx=1&sn=2e9c8be25d079fcf9dca9a9b67a90651&chksm=cf51be08f826371edc3226955f1acff0bb560f5256611b43ea91f9db6a579d5313b97ec598e5#rd
$\uparrow$ 打开上述链接即可阅读全文

毫米波感知领域的学习笔记：CVPR 2023, 精选案例: 四维雷达场景流学习基于多模态监督

0 Abstract

This paper * presents an innovative method for 4D radar-based scene flow estimation through cross-modal learning.

Motivation *

Co-located sensing redundancy in modern autonomous vehicles

The system offers diverse types of supervision hints as a means to estimate radar scene flow.

The methods are designed to establish a multimodal network structure for the identified cross-modal learning problem. The proposed approach incorporates scene flow estimation by formulating specific loss functions under multiple cross-modal requirements for efficient model training.

Experiments *

SOTA performance

The cross-modal supervised learning framework demonstrates remarkable success in the process of inferring more precise four-dimensional radar scene flow.

shown to be useful for two subtasks

✅ motion segmentation

✅ ego-motion estimation

Code：https://github.com/Toytiny/CMFlow

1 Introduction

Scene flow estimation *

Definition of scene flow estimation

Acquiring a three-dimensional motion vector field of stationary and moving surroundings relative to an egocentric reference frame.

Importance of scene flow in the context of self-driving

✅ Provides motion cues for various tasks

Current scene flow estimation approaches *

fully-supervised, weakly-supervised learning or rather than relying on self-supervised signals.

Challenges of these approaches

a labor-consuming process involved in annotating scene flows for supervised learning tasks

✅ the often subpar performance of self-supervised learning methods

Specific challenges in 4D radar scene flow learning *

Rise of 4D automotive radars

The system is immune to challenging circumstances and capable of measuring target velocity.

Complications with 4D radar point clouds

The scene flow annotation process for supervised learning is complicated by the presence of sparsity and noise in the point cloud structures.

Solution: Hidden Gems *

exploiting cross-modal supervision signals in autonomous vehicles

Contemporary self-driving cars are fitted with diverse sensor arrays that supply both reinforcing and supplementary data streams.

The authors concentrate resources on exploiting the potential of this colocated perception redundancy to deliver diverse and comprehensive supervision cues, which are essential for enhancing the effectiveness of radar scene flow learning.

The central research question is the process of acquiring and utilizing cross-modal supervision signals from sensors located on the vehicle to improve radar scene flow learning through their application.

The system's capability is enhanced through valuable supervision indicators across a range of sensors including an odometer, GPS/INS, LiDAR, and RGB cameras.

✅ Train: Multi-modal data; Test: Only radar data

初始型4D雷达场景流学习基于交叉模态指导
- 引入一种新型多任务模型架构及对应的损失函数定义
- 验证该方法在状态-of-the-art性能并评估其在下游任务中的有效性

Scene flow *

Scene flow was first defined as a 3D uplift of optical flow

Traditional approaches to scene flow from either RGB or RGB-D images:

This method is built upon existing knowledge frameworks or applies supervised learning techniques to train deep networks, alternatively using unsupervised learning approaches for the same purpose.

Various approaches directly estimate instantaneous scene flow based on 3D sparse point clouds.

✅ These methods may rely on online optimization

DL-based methods have long been the predominant approach for pointcloud-based scene flow estimation.

Deep scene flow on point clouds *

State-of-the-art (SOTA) solutions make full use of massive datasets during the training process (Supervised).

A fully supervised approach of GT flow is both labor-intensive and expensive in terms of scene flow annotations.

✅ simulated dataset for training: may result in poor generalization

Self-supervised learning frameworks aim to minimize the labor and challenges associated with synthetic data.

✅ Exploit supervision signals from the input data

performance is constrained: no real labels are employed to supervise their models

a Trade-off between annotation efforts and performance

Merge the ego-motion features with manually-annotated background segmentation labels

✅ ego-motion is easily assessed from odometry sensors

However, the segmentation labels are still manually annotated and costly to obtain.

Radar scene flow *

The previous studies are not directly applicable to the sparse and noisy radar point clouds.

they typically infer scene flow within dense point cloud datasets, which are captured by LiDAR or rendered from stereo images.

recent work introduces such a self-supervised framework aimed at radar scene flow estimation.

Despite the absence of authentic supervision cues, its performance in scene flow estimation is restricted.

Noting the proposal *

Solution of the supervision problem:

Acquire supervisory signals by utilizing data from co-located sensors in a fully automated process.

✅ without resorting to any human intervention during training

The model only depends on other modalities when it is being trained, there is no need for them when the model is in use for inference.

3 Method

3.1 Problem Definition

Defines the task of scene flow estimation.

Scene flow estimation :

The system is designed to address a motion field that represents the complex, non-rigid changes resulting from both the ego-vehicle’s movement and moving objects within its environment.

The input of a point cloud-based scene flow technique: the first and second consecutive point clouds. The target input is the second consecutive point cloud. For each point in these clouds, $\mathbf{c}_i$ represents 3D coordinates while $\mathbf{x}_i$ denotes raw features in $\mathbb{R}^C$ .

Output includes point-wise three-dimensional motion vectors $\mathbf{F}$ , defined as $\mathbf{F} = \{\mathbf{f}_i \in \mathbb{R}^3\}_{i=1}^{N}$ , which map each point of the source frame $\mathbf{P}^s$ to its corresponding positions $\mathbf c_i' = \ mathb c_i^s + \ mathb f_i$ in the target frame.

Note that $\mathbf{P}^s$ and $\mathbf{P}^t$ do not require having equal point counts (not requiring a strict one-to-one correspondence between points) implies that the corresponding locations $\mathbf{c}_i$ are not required to coincide with any points in the target point cloud $\mathbf{P}^t$ .

In the context of 4D radar, the raw point features include：

1 the relative radial velocity (RRV):

✅ contain partial motion information

2 the radar cross-section (RCS) measurements

The system is influenced by the reflective characteristics of the target as well as the incidence angles of the beams.

3.2 Overview

presents the overall pipeline of the proposed method.

The model estimates radar scene flow through two stages.
The first stage consists of:
- extracting fundamental feature descriptors from corresponding point clouds pairs
- utilizing two independent branches to compute per-point likelihoods of movement and initial scene flow vectors, which are then used to generate a binary motion segmentation mask.
The subsequent stage * estimates a rigid transformation with respect to radar ego-motion.
The final scene flow is generated through refining the flow vectors of identified static points using the derived rigid transformation, which effectively reduces noise interference.
The model generates several outputs:
A rigid-body transformation (referred to as ego-motion analysis)
A motion segmentation map (which distinguishes static from dynamic elements)
The computed scene flow as a refined estimation.
监督：从共存模态中提取了监督信号
该模型通过最小化由三部分组成的损失函数 $\mathcal{L}$ 实现端到端训练：
- $\mathcal{L} = \mathcal{L}_{\text{ego}}^\blacksquare + \mathcal{L}_{\text{seg}}^\blacksquare + \mathcal{L}_{\text{flow}}^\blacksquare$
- $\mathcal{L}_{\text{{ego}}}^\blacksquare$ ：用于衡量ego运动误差。
- $\mathcal{{loss}}_{\text{{segmentation}}}^\blacksquare$ ：用于衡量运动分割误差。
- $\mathcal{{loss}}_{\text{{flow}}}}^\blacksquare$ ：用于计算最终场景流。

3.3 Model Architecture

Encoder receives the input tensors $\mathbf{P}^s$ and $\mathbf{P}^t$ to generate point-wise latent features. Firstly, the model utilizes set convolutional layers to acquire multiple scales of local features from each individual point cloud. Subsequently, a cost volume layer is employed to compute feature correlations between different points. The base backbone generates output features E ∈ ℝ^{N×C_e},

The initial head of the model generates an initial spatial-temporal flow field: $\hat{\mathbf{F}}^{\text {init }}=\left\{\hat{\mathbf{f}}_i^{\text {init }} \in \mathbb{R}^3\right\}_{i=1}^N$ . This component captures the fundamental movement patterns across the input sequence. The second component outputs a motion probability map $\hat{\mathbf{S}}=\left\{\hat{s}_i \in[0,1]\right\}_{i=1}^N$ , which quantifies the likelihood of point movements within $\mathbf{P}^s$ . Both heads, implemented as multi-layer perceptrons with a sigmoid activation function, operate independently yet complement each other, sharing common base features extracted from the network backbone for training.

Ego-motion Head *

Considering that the natural correspondences $\left\{\mathbf{c}_i^s, \mathbf{c}_i^s+\hat{\mathbf{f}}_i^{\text {init }}\right\}_{i=1}^N$ are taken into account, along with a probability map $\hat{\mathbf{S}}$ , the framework is established.

Infers a rigid transformation that represents the radar’s ego-motion

✅ using the differentiable weighted Kabsch algorithm.

This type of transformation is also employed to create a binary motion segmentation mask that identifies stationary points for flow refinement

Thresholding $\hat{\mathbf{S}}$ using a fixed value η_b signifies whether each point is moving or not.

Adjustment Layer * It adjusts the initial scene flow of static points based on the computed ego-motion derived from the ego-motion head. The refinement process involves replacing... identified stationary points’ initial flow vectors with those induced by radar’s. This results in a comprehensive final scene flow structure that accurately represents...

Temporal Update Module * An optional component that can be integrated into the network architecture to transfer hidden states from one frame to another. * A recurrent neural network is employed, where the backbone's global features are treated as the system's hidden states, which are updated across consecutive frames temporally.

This model provides solutions to three different tasks

These outputs are compactly correlated with one another.

The model describes how it acquires cross-modal supervision signals from three co-located sensor platforms (odometer, LiDAR, and RGB camera) to facilitate model training without requiring manual annotations.

Ego-motion Loss * 的监督信息来源于里程计（ $\mathbf{O} \in \mathbb{R}^{4 \times 4}$ ）
通过逆变换 $\mathbf{T}=\mathbf{O}^{-1}$ 来推导出目标刚性变换矩阵（ $\mathbf{T}$ ），以便总结刚性流动组件集合（ $\{\mathbf{f}_i^r\}_{i=1}^N$ ）
车体坐标系下ego运动估计损失函数定义如下：
$$\mathcal{L}_{e g o}^{{\bullet}=\frac{1}{N}\sum_{i=1}}N\left|(\hat{\mathbf{T}}-\mathbf{T})[\mathbf{c}_i^s 1]^{\top}\right|_2

> > > supervising $\hat{\mathbf{T}}$ > > > > * **$\Rightarrow$** The initial scene flow $\hat{\mathbf{f}}_i^{\text { init }}$ for static points is implicitly constrained by this mechanism. > * Are constrained to ensure that the refinement process adheres to this constraint. **Motion Segmentation Loss** * Supervising signals from both an odometer and a LiDAR sensor are used to create pseudo motion segmentation labels $\mathbf{S}^v=\left\{s_i^v \in\{0,1\}\right\}_{i=1}^N$, along with $\mathbf{S}^{f g}$ A new pseudo-motion segmentation label $S^{l} = \left\{ s_{i}^{l} \in \{0, 1\} \right\}_{i=1}^{N}$ is generated by grouping points exhibiting apparent non-rigid flow into dynamic categories. A reliable pseudo-motion segmentation $S=\{s_i\}_{i=1}^N$ is finally achieved through the fusion of $S^\ell$ and $S^v$. Loss function is defined as half of the sum of two terms, each representing a different aspect of the segmentation task. The first term, which involves the negative examples, calculates the average logarithmic loss over all negative instances. The second term, focusing on positive examples, computes the average logarithmic loss over only those instances that were correctly predicted. Together, these components provide a balanced measure for evaluating segmentation performance. **Scene Flow Loss** * The motion segmentation error $\mathcal{L}_{m o t}^{\bullet}$ is determined by the pseudo scene flow labels $\mathbf{F}^{f g}$, which are derived from the 3D MOT results. $\mathcal{L}_{m o t}^{\bullet}$ 被定义为 $\frac{1}{\sum_{i=1}^{N}s_{i}}$ 乘以 $\sum_{j=1.5s}s_j$ 的平方根范数总和。 The optical flow loss $\mathcal{L}_{o p t}$ is based on pseudo optical-flow labels $\mathbf{W}=\left\{\mathbf{w}_i \in \mathbb{R}^2\right\}_{i=1}^N$ and employs them to measure the point-to-ray distance as its training objective. $\mathcal{L}_{\text {opt }}^{\bullet}$ 被定义为 $1 / (\sum_{i=1}^{N} s_i) \cdot (\sum_{i=1}^{N} s_i D(\boldsymbol{c}_i^{s}+\hat{\boldsymbol{f}}_i, \boldsymbol{m}_i+\boldsymbol{w}_i, θ)))$ The self-supervised self-supervision loss $\mathcal{L}_{\text {self }}^{\bullet}$, as described in reference [9], is employed to enhance the cross-modal supervision framework. Comprehensive scene-based motion flow loss is defined as $\mathcal{L}_{\text {flow }}^{\bullet}=\mathcal{L}_{\text {mot }}^{\bullet}+\lambda_{\text {opt }} \mathcal{L}_{\text {opt }}^{\circ}+\mathcal{L}_{\text {self }}^{\bullet}$, > > > Note: The supervision signals from individual modality-specific tasks are inevitably noisier than human annotation. > > > > * However, if these noisy supervision signals are well combined, then the overall noise in supervision can be suppressed and give rise to effective training > ### 4 Experiments #### 4.1 Experimental Setup This section provides detailed information about the underlying dataset, various performance metrics, common baseline models, and specific implementation techniques of the proposed framework. **Dataset** * Use the View-of-Delft (VoD) dataset This system offers synchronized and calibrated data captured by co-located sensors such as a 64-beam LiDAR, an RGB camera, an RTK-GPS/IMU-based odometer, and a 4D radar sensor. Create new splits from the official sets for their evaluation Data sets of data frames are grouped into sets of successive radar point cloud pairs in order to generate scene flow samples. **Metrics** * 三个关键指标用于场景流估计：平均终点误差（EPE）、严格/宽松条件（AccS/AccR）以及归一化分辨率误差（RNE）。 * 同时，我们分别计算移动点与静止点的RNE，并将其标记为MRNE与SRNE。 **Baselines** * 7 SOTA方法作为基准：5种基于自监督学习的方法 + 2种非学习方法。 * 为了保证公平性，请采用这些方法的标准配置参数设置 **Implementation Details** * Adam优化算法 * 用于模型选择以及确定训练中的超参数值 * 运动分割分类阈值设置为0.5 * 当时间更新模块被激活时 #### 4.2 Scene Flow Evaluation **Overall results** ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/7W8oELJRr5hilZq4u6BmcPafsSb1.png) CMFlow shows significant improvement over baselines on all metrics ✅ 37.6% lower EPE than second best method ✅ Higher reliability for safety-critical applications Adding temporal update scheme further improves CMFlow performance **Impact of modalities** ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/kM86uirHmSvGxCltcZhPBQNfyL3I.png) Odometer provides biggest performance gain ✅ Most points are static so ego-motion supervision very effective LiDAR and camera also contribute to gains ✅ Provide supervision for moving points 测量精度不如里程计：由于反射产生的光流误差 + LiDAR在物体检测中的误差 Results validate effective exploitation of each modality **Impact of unannotated data** ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/f7k6FnxsC9VKweMJcjGEdr3pHD4W.png) * Increasing the amount of unannotated training data benefits CMFlow's performance. * Even with just a 20% increase in unannotated data, CMFlow outperforms methods that are fully supervised but trained on significantly less annotated data. * The findings suggest promising opportunities for leveraging large unannotated datasets. **Qualitative results** ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/Oh3aBNmsb1YxZ7XC5M6QWoytRG2c.png) * Estimated scene flow effectively aligns stationary background with objects of interest. * Predictions closely approximate true values. Overall CMFlow shows leading-edge performance with no reliance on manual annotations. Evidences the efficacy of cross-modal supervised learning through systematic experiments. #### 4.3 Subtask Evaluation **Motion segmentation** ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/VKSAIfz60dosq1rta4NbxQwmWnFG.png) ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/GqxlUaZym37oLcXK4SNeHsDkC0AM.png) * CMFlow prediction enhances its effectiveness with additional unlabeled samples * The pseudo labels derived from the odometer and LiDar sensors contribute significantly to the system's performance * The qualitative analysis demonstrates precise segmentation of moving objects in various scenarios **Ego-motion estimation** ![在这里插入图片描述](https://ad.itadn.com/c/weblog/blog-img/images/2025-06-01/0ILiMsNF851AcEXQ73xPon9tbWGJ.png) * High precision frame-to-frame transformation estimation * Supervised by LiDAR and cameras, the system's performance is enhanced. * The long-term odometry outperforms the ICP baseline significantly. > > > * Overall * CMFlow enhances downstream applications: * Motion segmentation * Ego-motion estimation * * * Illustrates the effectiveness of cross-modal learning techniques. * ### 5 Conclusion * **关键点** * *多模态监督方法CMFlow用于4D雷达场景流的预测* * 首款基于 colocated传感器实现监督的方法 * 无需人工标注数据即可进行训练 *Experiments achieve superior performance across all metrics, showcasing consistent effectiveness.* *When our training utilizes a sufficient amount of unlabeled data, the proposed approach achieves superior performance compared to the fully-supervised method.* *The model's performance is demonstrated to improve significantly on two critical subtasks: motion segmentation and ego-motion estimation.*

全部评论 (0)

还没有任何评论哟~

毫米波感知论文阅读笔记：CVPR 2023, Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

原始笔记链接:https://mp.weixin.qq.com/s?biz=Mzg4MjgxMjgyMg==&mid=2247486705&idx=1&sn=2e9c8be25d079fcf9dca9...

毫米波成像论文阅读笔记 | HawkEye, CVPR 2020

原文链接:<https://mp.weixin.qq.com/s/KyMyyZert7ZYltfZaYrBWA 论文阅读笔记：[1]J.Guan,S.Madani,S.Jog,S.Gupta,andH...

4D毫米波雷达Radar

4D毫米波雷达Radar 围绕雷达、激光雷达、高精定位等新一代传感器技术将会进入量产周期。自动驾驶公司的竞争，在传感器配置上坦白说并没有太多差异化。除了车载激光雷达属于近几年的产物，类似摄像头、毫米...

4D毫米波雷达Radar

4D毫米波雷达Radar 概述全球前四大的毫米波雷达供应商被称为“ABCD”，即Autoliv（美安）、Bosch（博世）、Continental（大陆）和Delphi（德尔福）。除了全天候和低成...

毫米波雷达论文速递 | Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autononous Driving

注1:本文系“最新论文速览”系列之一,致力于简洁清晰地介绍、解读最新的顶会/顶刊论文 Arxiv2023DualRadar:AMultimodalDatasetwithDual4DRadarforAu...

无线感知论文阅读笔记 | TRS 2023, DeepEgo: Deep Instantaneous Ego-Motion Estimation Using Automotive Radar

原文链接：https://mp.weixin.qq.com/s?biz=Mzg4MjgxMjgyMg==&mid=2247486296&idx=1&sn=593f29ee45f66782e90b247...

多模态感知论文阅读笔记 | CVPR 2023, Depth Estimation from Camera Image and mmWave Radar Point Cloud

原文链接：https://mp.weixin.qq.com/s?biz=Mzg4MjgxMjgyMg==&mid=2247486213&idx=1&sn=c9680e0c959a6862bc92fea...

【论文阅读一】Adaptive Cross-Modal Few-shot Learning

1、introduction 这篇文章提出了一种将语义与视觉知识相结合的自适应的crossmodal。视觉和语义特征空间根据定义具有不同的结构。对于某些概念，视觉特征可能比文本特征更丰富，更具辨别力。

Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval 阅读笔记

重点是可扩展的（extendable）：训练和测试集中的实例具有不相交的类。一个新的框架：ModaladversarialSemanticLearningNetworkMASLN模态对抗语义学习网络...

论文阅读：Neural Scene Flow Prior

目录概要 Motivation 整体架构流程技术细节小结论文地址：[[2111.01253]NeuralSceneFlowPriorarxiv.org]https://arxiv.org/ab...

是否确定退出登录?

毫米波感知论文阅读笔记：CVPR 2023, Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

CVPR 2023 | Highlighting: 4D Radar Scene Flow Learning Using Cross-modal supervision

0 Abstract

1 Introduction

2 Related Work

3 Method

3.1 Problem Definition

3.2 Overview

3.3 Model Architecture

3.4 Cross-Modal Supervision Retrieving

全部评论 (0)

相关文章推荐

毫米波感知论文阅读笔记：CVPR 2023, Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

毫米波成像 论文阅读笔记 | HawkEye, CVPR 2020

4D毫米波雷达Radar

4D毫米波雷达Radar

毫米波雷达论文速递 | Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autononous Driving

无线感知论文阅读笔记 | TRS 2023, DeepEgo: Deep Instantaneous Ego-Motion Estimation Using Automotive Radar

多模态感知论文阅读笔记 | CVPR 2023, Depth Estimation from Camera Image and mmWave Radar Point Cloud

【论文阅读一】Adaptive Cross-Modal Few-shot Learning

Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval 阅读笔记

论文阅读：Neural Scene Flow Prior

毫米波成像论文阅读笔记 | HawkEye, CVPR 2020