《Deep RGB-D Saliency Detection with Depth-Sensitive Attentionand Automatic Multi-Modal Fusion》阅读理解
转载请注明出处。
作者:Peng Sun Wenhu Zhang Huanyu Wang Songyuan Li Xi Li
2021CVPR
作者提出了一个具有深度敏感注意力和自动多模态融合的深度RGB-D显著性检测网络,主要创新在于提出了 a depth-sensitive RGB feature modeling scheme using the depth-wise geometric prior of salient objects 和 an automatic architecture search approach 。
整个网络结构图的设计如下:

其中,The RGB branch 是基于VGG-19,the depth branch is a lightweight depth network。
网络设计的三个原则如下:
1)The features from different modalities of the same scale are always fused, while features in different scales are selectively fused.
2)Low-level features are always combined with high-level features before the fifinal prediction, as low-level features are rich in spatial details but lack semantic information and vice versa.
3)Attention mechanism is necessary when performing the feature fusion of different modalities.
Depth-Sensitive Attention
论文propose a depth-sensitive RGB feature modeling scheme, including the depth decomposition and the depth****sensitive attention module.
对于depth decomposition,the raw depth map is decomposed into T + 1 regions with the following steps:First, we quantize the raw depth map into the depth histogram, and choose the T largest depth distribution modes (corresponding to the T depth interval windows) of the depth histogram. Then, using these depth interval windows, the raw depth map can be decomposed into T regions, and the remaining part of the histogram naturally forms the last region, as shown in Fig. 3(a). Finally, each region is normalized into [0,1] as a spatial attention mask for the subsequent process.(简单来说,根据深度图的直方图将其分为T+1个区域,然后对每一个区域进行normalized作为空间注意力mask。)
对于depth sensitive attention module,如Fig. 3(b):

其中,Pooling是max-pooling operation to align the masks to the size of F rgb k as 


通过这种方式,DSAM不仅为RGB特征提供了深度方面的几何先验知识,而且还消除了棘手的
background distraction ( e.g . cluttered objects or similar texture).
Auto Multi-Modal Multi-Scale Feature Fusion
论文 design four types of cells:the multimodal fusion (MM), m ulti-s cale fusion (MS), global context aggregation (GA) and s patial information r estoration (SR) cells。First, we use MM cells to directly perform multi-modal feature fusion between RGB and depth branches. Second, we use MS cells for the dense multi-scale feature fusion. Third, we utilize GA cell to aggregate seamlessly the outputs of the MS cells for capturing the global context. Finally, we introduce SR cells to combine the low-level and high-level features to remedy the spatial detail loss caused by downsampling.




Experiments
Comparison with State-of-the-art


Ablation Analysis




