[COD] Camouflaged Object Detection(CVPR 2020.oral)

文章目录
-
1. Motivation
-
2. Contribution
-
3. Relation Work
-
- 3.1 Generic and Salient Object Detection
- 3.2 Camouflaged Object Detection
-
- 3.2.1 Types of Camouflage
- 3.2.2 COD Formulation
- 3.2.3 Evaluation Metrics.
-
4. Dataset
-
- 4.1 Professional Annotation
- 4.2 Dataset Features and Statistics
-
5. Proposed Framework
-
- 5.1 Overview
- 5.2 Search Module(SM)
- 5.3 Identification Module(IM)
- 5.4 Partial Decoder Component (PDC)
-
6. Experiment
-
- 6.1 Performance on CHAMELEON, CAMO-Test and COD10K-Test
- 6.1 Qualitative Analysis
1. Motivation

本文研究的领域为COD(Camouflaged object detection),伪装物体检测的定义:
Camouflaged object detection(COD) aims to identify objects that are ‘seamlessly’ embedded in their surroundings.
由于在COD中的target和background具有intrinsic similarities,因此相对于传统的目标检测来说,COD更具有挑战性。
The high intrinsic similarities between the target object and the background make COD far more challenging than the traditional object detection task.
COD的适用性:
COD is also beneficial for applications in the fields of computer vision , medical image segmentation , agriculture and art.
COD领域数据集缺乏。
Currently, camouflaged object detection is not well studied due to the lack of a sufficiently large dataset.
2. Contribution
本文提出了COD10K 数据集。
To address this issue, we elaborately collect a novel dataset, called COD10K, which comprises 10,000 images covering camouflaged objects in various natural scenes, over 78 object categories.
本文构建了用于COD的Search Identification Network (SINet)框架。
In addition, we develop a simple but effective framework for COD, termed Search Identifi-cation Network (SINet).
其中COD10K与现有的COD数据集的区别有以下3点:
COD10K包含了10K,78个类别,水陆地两栖。
It contains 10K images covering 78 camouflaged object categories, such as aquatic, flying, amphibians, and terrestrial, etc.
COD10K的annotation信息,并且可以用于多任务中。
All the camouflaged images are hierarchically annotated with category, bounding-box, object-level, and instance-level labels, facilitating many vision tasks, such as localization, object proposal, semantic edge detection [42], task transfer learning [69], etc.
高质量的annotation促进算法的性能
Each camouflaged image is assigned with challenging attributes found in the real-world and matting-level [73] labeling (requiring ∼60 minutes per image). These high-quality annotations could help with providing deeper insight into the performance of algorithms.
图1是COD10K与之前COD任务中的2个数据集的summary的对比。

3. Relation Work

图2,分为了GOD,SOD,以及COD。
3.1 Generic and Salient Object Detection
其中GOD是最流行的视觉研究方向,用于研究语义分割和全景分割。
One of the most popular directions in computer vision is generic object detection.
Typical GOD tasks include semantic segmentation and panoptic segmentation
SOD是一张图片中最引人注意的物体。与salient相反的就是camouflaged。
That is, positive samples (images containing a salient object) can be utilized as the negative samples in a COD dataset.
3.2 Camouflaged Object Detection
3.2.1 Types of Camouflage
camouflaged 图片可以分为2种类型: 自然伪装以及人造伪装。
自然伪装是动物使用的;人造的伪装发生在产品中(所有的defect缺陷),或者在游戏用用于隐藏信息。
Camouflaged images can be roughly split into two types: those containing natural camouflage and those with artificial camouflage
3.2.2 COD Formulation
与class-dependent的语义分割不同,COD是class-independent task。COD formulation:给定一张图片,每个像素都有一个confidence p,p_i \in [0, 1],如果pixels的值为0,表示它不属于camouflaged objects,如果是1表明pixel表示camouflagd objects。
Given an image, the task requires a camouflaged object detection approach to assign each pixel i a confidence pi ∈ [0,1], where pi denotes the probability score of pixel i.
A score of 0 is given to pixels that don’t belong to the camouflaged objects, while a score of 1 indicates that a pixel is fully assigned to the camouflaged objects.
3.2.3 Evaluation Metrics.
不适用原本的MAE。
本文使用3个metrics:
- a human visual perception based E-measure (E_{\phi}),which simultaneously evaluates the pixel-level matching and image-level statistics
- Since camouflaged objects often contain complex shapes, COD also requires a metric that can judge structural similarity. We utilize the S-measure (S_\alpha) [12] as our alternative metric.
- the weighted F-measure (F^w_{\beta}) [43] can provide more reliable evaluation
4. Dataset


our goals for studying and developing a dataset for COD are:
(1) to provide a new challenging task,
(2) to promote research in a new topic,
(3) to spark novel ideas.
4.1 Professional Annotation

如图4所示,关于COD10K的annoation信息。
Categories
5个super-class,69个sub-class。
Bounding boxes
Attributes
Objects/Instances(Mask in coco) (这个主要是coco中的mask信息,感觉就是实例化的annotation)
4.2 Dataset Features and Statistics
如图4和图6所示,关于COD10K的一些statistics。
- Object size
- Global/Local contrast
- Center bias
- Quality control
- Super/Sub-class distribution
- Resolution distribution
- Dataset splits

5. Proposed Framework

5.1 Overview
图8为SINet framework。包含了2个主要成分,RF和PDC。
The RF is introduced to mimic the structure of RFs in the human visual system.
The PDC reproduces the search and identification stages of animal predation.
SINet 收到了2阶段hunting的启发,它包含了2个主要的模块, the search module(SM)以及identification module(IM)。
SM用于找到camouflagd obejct,而后者IM用于精确的检测它。
5.2 Search Module(SM)
使用RF模块的动机是模仿pRFs,为了吸收更多具有判别表示的特征。
This motivates us to use an RF [41, 68] component to incorporate more discriminative feature representations during the searching stage.
给定输入I \in R^{W \times H \times 3},从ResNet-50中提取\{ \mathcal X \}^4_{k=0},RF的每一层的分辨率保持不变\{ \frac{H}{k}, \frac{W}{k}, k=4,4,8,16,32 \}。通过concatenation,up-sampling和down-sampling的操作进行将5个层之间进行结合,从而得到了\{rf^s_k ,k=1,2,3,4 \}。
低维度 浅层保留了物体边界的空间信息,高维度 深层保留了定位物体的语义信息。
Receptive Field (RF)包括了5个branches,在每一个branch中的第一个Bconv(Conv+BN+ReLU)使用1x1卷积将通道数量降维为32,然后接着(2k-1)x(2k-1)的Bconv,(通过代码和网络结构图发现,其实具体来说是 [1x(2k-1)]和[(2k-1)x1],以及带有(2k-1)的空洞率的3x3的Bconv。前4个branches进行cat操作,接着使用1x1(代码里面是3x3)的Bconv将通道降维为32,最后将第五个branch和前四层通过Bconv的特征进行add操作,进行ReLU操作后,获得特征rf_k。具体的代码如下:
def RF_MODULE():
self.branch1 == nn.Sequential(
BasicConv2d(in_channel, out_channel, 1), #conv + bn + ReLU
BasicConv2d(out_channel, out_channel, kernel_size(1, 3), padding=(0, 1)),
BasicConv2d(out_channel, out_channel, kernel_size(3, 1), padding=(1, 0)),
BasicConv2d(out_channel, out_channel, 3, padding=3, dilation=3), # 3x3 conv
)
self.branch2 == nn.Sequential(
BasicConv2d(in_channel, out_channel, 1), #conv + bn + ReLU
BasicConv2d(out_channel, out_channel, kernel_size(1, 5), padding=(0, 2)),
BasicConv2d(out_channel, out_channel, kernel_size(5, 1), padding=(2, 0)),
BasicConv2d(out_channel, out_channel, 3, padding=5, dilation=5), # 3x3 conv
)
self.branch1 == nn.Sequential(
BasicConv2d(in_channel, out_channel, 1), #conv + bn + ReLU
BasicConv2d(out_channel, out_channel, kernel_size(1, 7), padding=(0, 3)),
BasicConv2d(out_channel, out_channel, kernel_size(7, 1), padding=(3, 0)),
BasicConv2d(out_channel, out_channel, 3, padding=7, dilation=7), # 3x3 conv
)
self.conv_cat = BasicConv2d(4*out_channel, out_channel, 3, padding=1)
self.conv_res = BasicConv2d(in_channel, out_channel, 1)
5.3 Identification Module(IM)
IM结构用于检测camouflaged object,本文扩展了partial decoder componet(PDC),带有稠密的连接特征。
确切来说,PDC整合了SM结构中的4层特征。the coarse camouflage map C_s可以通过公式计算:

作者引入attention结构的在于它可以消除不相关特征的干扰。本文介绍了一种search attenion(SA)module,来加强middle-level 特征X2,并且获得enhanced camouflage map C_h。

g(\cdot)是SA function,是高斯滤波,带有标准差32和kernel 4。查看了代码发现,貌似在max操作后,最后将attenion和x进行矩阵乘法,
class SA(nn.Module):
"""
holistic attention src
"""
def __init__(self):
super(SA, self).__init__()
gaussian_kernel = np.float32(_get_kernel(31, 4)) # shape: [31, 31] array
gaussian_kernel = gaussian_kernel[np.newaxis, np.newaxis, ...] # [1,1,31,31]
self.gaussian_kernel = Parameter(torch.from_numpy(gaussian_kernel)) # [1,1,31,31]
def forward(self, attention, x):
# attention 是Cs x是x2
# attention: (torch.randn(1, 1,44,44)
# x: torch.randn(1, 512,44,44))
soft_attention = F.conv2d(attention, self.gaussian_kernel, padding=15) # [1,1,44,44]
soft_attention = min_max_norm(soft_attention) # normalization
x = torch.mul(x, soft_attention.max(attention)) # mul
return x # [1,512 ,44,44]
最后利用PDC和RF来聚合其他三层的特征,获得final camouflage map C_i:

5.4 Partial Decoder Component (PDC)
PDC中的计算不难理解,只是多了一个element-wise的操作,核心还是聚合多层特征,并且保持上采样到rf_4^c的大小(也就是原图缩放的8倍 图上是44x44),并且通过代码(其实通过文中的SINet Network也能得出),PDC的输出是一个channel为1的灰度图,I^{PDC} \in [H/8 \times W/8 \times 1]。

网络的总损失包括了Ccsm和Ccim各自与gt的交叉熵loss,C_{csm}和C_{cim}是2个camouflaged object maps,通过C_s和C_i获得并且上采样8倍

6. Experiment
作者构建了12个基于3个数据集的baseline。
6.1 Performance on CHAMELEON, CAMO-Test and COD10K-Test

6.1 Qualitative Analysis

