毫米波感知论文阅读笔记:CVPR 2023, Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision
原始笔记链接:https://mp.weixin.qq.com/s?__biz=Mzg4MjgxMjgyMg==&mid=2247486705&idx=1&sn=2e9c8be25d079fcf9dca9a9b67a90651&chksm=cf51be08f826371edc3226955f1acff0bb560f5256611b43ea91f9db6a579d5313b97ec598e5#rd
\uparrow 打开上述链接即可阅读全文
CVPR 2023 | Highlighting: 4D Radar Scene Flow Learning Using Cross-modal supervision
毫米波感知领域的学习笔记:CVPR 2023, 精选案例: 四维雷达场景流学习基于多模态监督

0 Abstract
This paper * presents an innovative method for 4D radar-based scene flow estimation through cross-modal learning.
Motivation *
Co-located sensing redundancy in modern autonomous vehicles
The system offers diverse types of supervision hints as a means to estimate radar scene flow.
The methods are designed to establish a multimodal network structure for the identified cross-modal learning problem. The proposed approach incorporates scene flow estimation by formulating specific loss functions under multiple cross-modal requirements for efficient model training.
Experiments *
SOTA performance
The cross-modal supervised learning framework demonstrates remarkable success in the process of inferring more precise four-dimensional radar scene flow.
shown to be useful for two subtasks
✅ motion segmentation
✅ ego-motion estimation

1 Introduction
Scene flow estimation *
Definition of scene flow estimation
Acquiring a three-dimensional motion vector field of stationary and moving surroundings relative to an egocentric reference frame.
Importance of scene flow in the context of self-driving
✅ Provides motion cues for various tasks
Current scene flow estimation approaches *
fully-supervised, weakly-supervised learning or rather than relying on self-supervised signals.
Challenges of these approaches
a labor-consuming process involved in annotating scene flows for supervised learning tasks
✅ the often subpar performance of self-supervised learning methods
Specific challenges in 4D radar scene flow learning *
Rise of 4D automotive radars
The system is immune to challenging circumstances and capable of measuring target velocity.
Complications with 4D radar point clouds
The scene flow annotation process for supervised learning is complicated by the presence of sparsity and noise in the point cloud structures.
Solution: Hidden Gems *
exploiting cross-modal supervision signals in autonomous vehicles
Contemporary self-driving cars are fitted with diverse sensor arrays that supply both reinforcing and supplementary data streams.
The authors concentrate resources on exploiting the potential of this colocated perception redundancy to deliver diverse and comprehensive supervision cues, which are essential for enhancing the effectiveness of radar scene flow learning.
The central research question is the process of acquiring and utilizing cross-modal supervision signals from sensors located on the vehicle to improve radar scene flow learning through their application.
The system's capability is enhanced through valuable supervision indicators across a range of sensors including an odometer, GPS/INS, LiDAR, and RGB cameras.
✅ Train: Multi-modal data; Test: Only radar data
- 初始型4D雷达场景流学习基于交叉模态指导
- 引入一种新型多任务模型架构及对应的损失函数定义
- 验证该方法在状态-of-the-art性能并评估其在下游任务中的有效性
2 Related Work
- Scene flow *
Scene flow was first defined as a 3D uplift of optical flow
Traditional approaches to scene flow from either RGB or RGB-D images:
This method is built upon existing knowledge frameworks or applies supervised learning techniques to train deep networks, alternatively using unsupervised learning approaches for the same purpose.
Various approaches directly estimate instantaneous scene flow based on 3D sparse point clouds.
✅ These methods may rely on online optimization
DL-based methods have long been the predominant approach for pointcloud-based scene flow estimation.
- Deep scene flow on point clouds *
State-of-the-art (SOTA) solutions make full use of massive datasets during the training process (Supervised).
A fully supervised approach of GT flow is both labor-intensive and expensive in terms of scene flow annotations.
✅ simulated dataset for training: may result in poor generalization
Self-supervised learning frameworks aim to minimize the labor and challenges associated with synthetic data.
✅ Exploit supervision signals from the input data
performance is constrained: no real labels are employed to supervise their models
a Trade-off between annotation efforts and performance
Merge the ego-motion features with manually-annotated background segmentation labels
✅ ego-motion is easily assessed from odometry sensors
However, the segmentation labels are still manually annotated and costly to obtain.
- Radar scene flow *
The previous studies are not directly applicable to the sparse and noisy radar point clouds.
they typically infer scene flow within dense point cloud datasets, which are captured by LiDAR or rendered from stereo images.
recent work introduces such a self-supervised framework aimed at radar scene flow estimation.
Despite the absence of authentic supervision cues, its performance in scene flow estimation is restricted.
- Noting the proposal *
Solution of the supervision problem:
Acquire supervisory signals by utilizing data from co-located sensors in a fully automated process.
✅ without resorting to any human intervention during training
The model only depends on other modalities when it is being trained, there is no need for them when the model is in use for inference.
3 Method
3.1 Problem Definition
Defines the task of scene flow estimation.
Scene flow estimation :
The system is designed to address a motion field that represents the complex, non-rigid changes resulting from both the ego-vehicle’s movement and moving objects within its environment.
The input of a point cloud-based scene flow technique: the first and second consecutive point clouds. The target input is the second consecutive point cloud. For each point in these clouds, \mathbf{c}_i represents 3D coordinates while \mathbf{x}_i denotes raw features in \mathbb{R}^C.
Output includes point-wise three-dimensional motion vectors \mathbf{F}, defined as \mathbf{F} = \{\mathbf{f}_i \in \mathbb{R}^3\}_{i=1}^{N}, which map each point of the source frame \mathbf{P}^s to its corresponding positions \mathbf c_i' = \ mathb c_i^s + \ mathb f_i in the target frame.
Note that \mathbf{P}^s and \mathbf{P}^t do not require having equal point counts (not requiring a strict one-to-one correspondence between points) implies that the corresponding locations \mathbf{c}_i are not required to coincide with any points in the target point cloud \mathbf{P}^t.
- In the context of 4D radar, the raw point features include:
1 the relative radial velocity (RRV):
✅ contain partial motion information
2 the radar cross-section (RCS) measurements
The system is influenced by the reflective characteristics of the target as well as the incidence angles of the beams.
3.2 Overview
presents the overall pipeline of the proposed method.

-
The model estimates radar scene flow through two stages.
-
The first stage consists of:
- extracting fundamental feature descriptors from corresponding point clouds pairs
- utilizing two independent branches to compute per-point likelihoods of movement and initial scene flow vectors, which are then used to generate a binary motion segmentation mask.
-
The subsequent stage * estimates a rigid transformation with respect to radar ego-motion.
-
The final scene flow is generated through refining the flow vectors of identified static points using the derived rigid transformation, which effectively reduces noise interference.
-
The model generates several outputs:
-
A rigid-body transformation (referred to as ego-motion analysis)
-
A motion segmentation map (which distinguishes static from dynamic elements)
-
The computed scene flow as a refined estimation.
-
监督:从共存模态中提取了监督信号
-
该模型通过最小化由三部分组成的损失函数\mathcal{L}实现端到端训练:
- \mathcal{L} = \mathcal{L}_{\text{ego}}^\blacksquare + \mathcal{L}_{\text{seg}}^\blacksquare + \mathcal{L}_{\text{flow}}^\blacksquare
- \mathcal{L}_{\text{{ego}}}^\blacksquare:用于衡量ego运动误差。
- \mathcal{{loss}}_{\text{{segmentation}}}^\blacksquare:用于衡量运动分割误差。
- \mathcal{{loss}}_{\text{{flow}}}}^\blacksquare:用于计算最终场景流。
3.3 Model Architecture
Encoder receives the input tensors \mathbf{P}^s and \mathbf{P}^t to generate point-wise latent features. Firstly, the model utilizes set convolutional layers to acquire multiple scales of local features from each individual point cloud. Subsequently, a cost volume layer is employed to compute feature correlations between different points. The base backbone generates output features E ∈ ℝ^{N×C_e},
The initial head of the model generates an initial spatial-temporal flow field: \hat{\mathbf{F}}^{\text {init }}=\left\{\hat{\mathbf{f}}_i^{\text {init }} \in \mathbb{R}^3\right\}_{i=1}^N. This component captures the fundamental movement patterns across the input sequence. The second component outputs a motion probability map \hat{\mathbf{S}}=\left\{\hat{s}_i \in[0,1]\right\}_{i=1}^N, which quantifies the likelihood of point movements within \mathbf{P}^s. Both heads, implemented as multi-layer perceptrons with a sigmoid activation function, operate independently yet complement each other, sharing common base features extracted from the network backbone for training.
Ego-motion Head *
Considering that the natural correspondences \left\{\mathbf{c}_i^s, \mathbf{c}_i^s+\hat{\mathbf{f}}_i^{\text {init }}\right\}_{i=1}^N are taken into account, along with a probability map \hat{\mathbf{S}}, the framework is established.
Infers a rigid transformation that represents the radar’s ego-motion
✅ using the differentiable weighted Kabsch algorithm.
This type of transformation is also employed to create a binary motion segmentation mask that identifies stationary points for flow refinement
Thresholding \hat{\mathbf{S}} using a fixed value η_b signifies whether each point is moving or not.
Adjustment Layer * It adjusts the initial scene flow of static points based on the computed ego-motion derived from the ego-motion head. The refinement process involves replacing... identified stationary points’ initial flow vectors with those induced by radar’s. This results in a comprehensive final scene flow structure that accurately represents...
Temporal Update Module * An optional component that can be integrated into the network architecture to transfer hidden states from one frame to another. * A recurrent neural network is employed, where the backbone's global features are treated as the system's hidden states, which are updated across consecutive frames temporally.
This model provides solutions to three different tasks
These outputs are compactly correlated with one another.
3.4 Cross-Modal Supervision Retrieving
The model describes how it acquires cross-modal supervision signals from three co-located sensor platforms (odometer, LiDAR, and RGB camera) to facilitate model training without requiring manual annotations.
- Ego-motion Loss * 的监督信息来源于里程计(\mathbf{O} \in \mathbb{R}^{4 \times 4})
- 通过逆变换\mathbf{T}=\mathbf{O}^{-1}来推导出目标刚性变换矩阵(\mathbf{T}),以便总结刚性流动组件集合(\{\mathbf{f}_i^r\}_{i=1}^N)
- 车体坐标系下ego运动估计损失函数定义如下:
$$\mathcal{L}_{e g o}{\bullet}=\frac{1}{N}\sum_{i=1}N\left|(\hat{\mathbf{T}}-\mathbf{T})[\mathbf{c}_i^s 1]^{\top}\right|_2
> > > supervising $\hat{\mathbf{T}}$ > > > > * **$\Rightarrow$** The initial scene flow $\hat{\mathbf{f}}_i^{\text { init }}$ for static points is implicitly constrained by this mechanism. > * Are constrained to ensure that the refinement process adheres to this constraint. **Motion Segmentation Loss** * Supervising signals from both an odometer and a LiDAR sensor are used to create pseudo motion segmentation labels $\mathbf{S}^v=\left\{s_i^v \in\{0,1\}\right\}_{i=1}^N$, along with $\mathbf{S}^{f g}$ A new pseudo-motion segmentation label $S^{l} = \left\{ s_{i}^{l} \in \{0, 1\} \right\}_{i=1}^{N}$ is generated by grouping points exhibiting apparent non-rigid flow into dynamic categories. A reliable pseudo-motion segmentation $S=\{s_i\}_{i=1}^N$ is finally achieved through the fusion of $S^\ell$ and $S^v$. Loss function is defined as half of the sum of two terms, each representing a different aspect of the segmentation task. The first term, which involves the negative examples, calculates the average logarithmic loss over all negative instances. The second term, focusing on positive examples, computes the average logarithmic loss over only those instances that were correctly predicted. Together, these components provide a balanced measure for evaluating segmentation performance. **Scene Flow Loss** * The motion segmentation error $\mathcal{L}_{m o t}^{\bullet}$ is determined by the pseudo scene flow labels $\mathbf{F}^{f g}$, which are derived from the 3D MOT results. $\mathcal{L}_{m o t}^{\bullet}$ 被定义为 $\frac{1}{\sum_{i=1}^{N}s_{i}}$ 乘以 $\sum_{j=1.5s}s_j$ 的平方根范数总和。 The optical flow loss $\mathcal{L}_{o p t}$ is based on pseudo optical-flow labels $\mathbf{W}=\left\{\mathbf{w}_i \in \mathbb{R}^2\right\}_{i=1}^N$ and employs them to measure the point-to-ray distance as its training objective. $\mathcal{L}_{\text {opt }}^{\bullet}$ 被定义为 $1 / (\sum_{i=1}^{N} s_i) \cdot (\sum_{i=1}^{N} s_i D(\boldsymbol{c}_i^{s}+\hat{\boldsymbol{f}}_i, \boldsymbol{m}_i+\boldsymbol{w}_i, θ)))$ The self-supervised self-supervision loss $\mathcal{L}_{\text {self }}^{\bullet}$, as described in reference [9], is employed to enhance the cross-modal supervision framework. Comprehensive scene-based motion flow loss is defined as $\mathcal{L}_{\text {flow }}^{\bullet}=\mathcal{L}_{\text {mot }}^{\bullet}+\lambda_{\text {opt }} \mathcal{L}_{\text {opt }}^{\circ}+\mathcal{L}_{\text {self }}^{\bullet}$, > > > Note: The supervision signals from individual modality-specific tasks are inevitably noisier than human annotation. > > > > * However, if these noisy supervision signals are well combined, then the overall noise in supervision can be suppressed and give rise to effective training > ### 4 Experiments #### 4.1 Experimental Setup This section provides detailed information about the underlying dataset, various performance metrics, common baseline models, and specific implementation techniques of the proposed framework. **Dataset** * Use the View-of-Delft (VoD) dataset This system offers synchronized and calibrated data captured by co-located sensors such as a 64-beam LiDAR, an RGB camera, an RTK-GPS/IMU-based odometer, and a 4D radar sensor. Create new splits from the official sets for their evaluation Data sets of data frames are grouped into sets of successive radar point cloud pairs in order to generate scene flow samples. **Metrics** * 三个关键指标用于场景流估计:平均终点误差(EPE)、严格/宽松条件(AccS/AccR)以及归一化分辨率误差(RNE)。 * 同时,我们分别计算移动点与静止点的RNE,并将其标记为MRNE与SRNE。 **Baselines** * 7 SOTA方法作为基准:5种基于自监督学习的方法 + 2种非学习方法。 * 为了保证公平性,请采用这些方法的标准配置参数设置 **Implementation Details** * Adam优化算法 * 用于模型选择以及确定训练中的超参数值 * 运动分割分类阈值设置为0.5 * 当时间更新模块被激活时 #### 4.2 Scene Flow Evaluation **Overall results**  CMFlow shows significant improvement over baselines on all metrics ✅ 37.6% lower EPE than second best method ✅ Higher reliability for safety-critical applications Adding temporal update scheme further improves CMFlow performance **Impact of modalities**  Odometer provides biggest performance gain ✅ Most points are static so ego-motion supervision very effective LiDAR and camera also contribute to gains ✅ Provide supervision for moving points 测量精度不如里程计:由于反射产生的光流误差 + LiDAR在物体检测中的误差 Results validate effective exploitation of each modality **Impact of unannotated data**  * Increasing the amount of unannotated training data benefits CMFlow's performance. * Even with just a 20% increase in unannotated data, CMFlow outperforms methods that are fully supervised but trained on significantly less annotated data. * The findings suggest promising opportunities for leveraging large unannotated datasets. **Qualitative results**  * Estimated scene flow effectively aligns stationary background with objects of interest. * Predictions closely approximate true values. Overall CMFlow shows leading-edge performance with no reliance on manual annotations. Evidences the efficacy of cross-modal supervised learning through systematic experiments. #### 4.3 Subtask Evaluation **Motion segmentation**   * CMFlow prediction enhances its effectiveness with additional unlabeled samples * The pseudo labels derived from the odometer and LiDar sensors contribute significantly to the system's performance * The qualitative analysis demonstrates precise segmentation of moving objects in various scenarios **Ego-motion estimation**  * High precision frame-to-frame transformation estimation * Supervised by LiDAR and cameras, the system's performance is enhanced. * The long-term odometry outperforms the ICP baseline significantly. > > > * Overall * CMFlow enhances downstream applications: * Motion segmentation * Ego-motion estimation * * * Illustrates the effectiveness of cross-modal learning techniques. * ### 5 Conclusion * **关键点** * *多模态监督方法CMFlow用于4D雷达场景流的预测* * 首款基于 colocated传感器实现监督的方法 * 无需人工标注数据即可进行训练 *Experiments achieve superior performance across all metrics, showcasing consistent effectiveness.* *When our training utilizes a sufficient amount of unlabeled data, the proposed approach achieves superior performance compared to the fully-supervised method.* *The model's performance is demonstrated to improve significantly on two critical subtasks: motion segmentation and ego-motion estimation.*
