Advertisement

[WSIS] Weakly-supervised Instance Segmentation via Class-agnostic Learning with Salient Images

阅读量:
image-20210519152431962

1. Motivation

弱监督实例分割(WSIS)

Weakly-supervised instance segmentation (WSIS) is important in computer vision for at least two reasons.

Humans have a strong class-agnostic object segmentation ability and can outline boundaries ofunknown objects precisely, which motivates us to propose a box-supervised class-agnostic object segmentation (BoxCaseg) based so- lution for weakly-supervised instance segmentation.

对于BoxCaseg的定义:

We think the key problem in WSIS is box-supervised class-agnostic object segmentation (BoxCaseg), i.e., given the object bounding box, we need to infer pixel-level object mask.

关于salient image 和 box-supervised images如图1所示:
image-20210519153328216

Related work

There is a kind of methods [7, 36, 20] utiliz- ing traditional image segmentation algorithms

Dai et al. [7] first proposes to combine deep networks with hand-crafted object proposals for WSIS via iterative training and pseudo labeling.

BBTP [16] designs a multi-instance learning (MIL) formulation to train a weakly-supervised segmentation model.

Our weak segmentation head is also based on BBTP and does not rely on any traditional algorithms.

SOD

Our method is also based on the development of salient object detection (SOD, as known as salient object segmen- tation) [19, 49, 29, 4, 28, 15, 46], which aims at finding visually attractive objects in an image and segmenting them as a binary mask.

2. Contribution

We propose a novel weakly-supervised instance segmentation method based on box-supervised class-agnostic object segmentation, in which class-agnostic precise salient object localization information is utilized as auxiliary memory to promote weakly-supervised instance segmentation during training.

We propose new mask merging and dropping strategies to obtain high-quality proxy masks for pseudo-training Mask R-CNN.

The class-agnostic segmentation model has great generalization ability. Only using 7991 salient images which are disjointed with PASCAL and COCO, for the first time, a box-supervised instance segmentation obtains similar performance with its fully-supervised counterpart. On COCO, our result is significantly better than previous state-of-the-art WSIS methods.

3. Method

3.1 Weakly-supervised Instance Segmentation via Multi-instance Learning

如图3所示,BBTP的方法,使用positive bags(box内部的只要包含一个属于物体的pixel)positive bags和negative bags的定义:

The entire region of the object is in the bounding box and each row or each column in the bounding box must contain at least one pixel belonging to the object. BBTP regards these rows and columns as positive bags.

Other rows outside the bounding box are treated as negative bags.

分类概率可以通过max-pooling pixel分类概率来生成。

Once the positive and negative bags are generated, its classification probabilities are generated by max-pooling pixel classification probabilities.
image-20210519163627939

3.2 Joint Training with Salient Images

3.2.1 Salient Images

salient images 来自the DUTS-TR dataset。

对于salient images,我们确保每张图片只会有一个物体,因此物体较大,并且会占据较大的空间。

利用salient images来产生fine-grained的物体边界信息。

3.2.2 Data Augmentation

作者不像BBTP那样使用RoI Align来同时完成WSIS的检测和分割操作,本文不使用RoI Align操作,只是将物体crop下来,作为网络的输入。对于salient images以及box-supervised images的crop(data augmentation)不同。

对于salient images,先resized 到 320x320,然后随机cropped 成288x288,作为输入;对于box-supervised的方法如下:
image-20210519163948117

This data augmentation strategy shifts the position and shape of the bounding box and introduces background information.

3.3. Multi-task Learning

image-20210519162221121

如图4所示,表示对于本文联合训练box监督信息且无类别的分割框架图。网络的输入是salient 和 box-supervised images,经过一个backbone HRNet来提取convolutional feature maps,其中分为了weak以及salient的feature maps。weak feature map 传至weak seg head,计算MIL loss;salient feature map同时传至 weak segmentation , transferred segmentation 以及salient segmentation head,transferred segmentation 以及salient segmentation head两个heads一起计算pxiel-wise loss。

对于3个heads的结构,如图4的右图所示。是完全一样的网络结构。

训练BoxCaseg包含了3个任务,分别是MIL training for box-supervised images, MIL training for salient images, and pixel-labeling for salient images.

3.3.1 MIL loss for weak segmentation

MIL loss由公式1所示:
image-20210519165010231

B属于bags,B \in \{B^{+}, B^{-}\},S属于经过sigmoid后的score map,S \in \{ S_s, S_w\}。p属于postiion,为了使得S§的得分位于0~1。整个loss的优化方向是:使得postive bags中的position经过sigmoid后的得分值大,从而log(\max S(p)损失小,使得negative bags中的position经过sigmoid后的的得分值小,从而损失中log(1-\max S(p)小。即对于公式1来说,对于Bags中的每一个的postive/negative bag,B中的最大的pixel(等于是进行了进一步的筛选)所在的score mapS(p)

其中最后一项是一个smooth term。如公式2所示:
image-20210519170953330 image-20210519165016214

其中\Omega§ 表示的是对于像素p周围的eight-connected region pixel。

3.3.2 Pixel-wise loss for salient segmentation and transferred segmentation

作者通过weight transferr方法,使得salient images中的pixel-wise labeling参数更适合box-supervised images。

weight transfer module的设计:

As for the weight transfer module, we simply use a two-layer multi-layer perceptron (MLP) with leaky ReLU as activation function.

并且weight transfer是一个单项的过程,因为它的梯度从weak segmentation head中detach,梯度不会反传至weak segmentation head中。

Note that the weight transfer is a single direction process, since it is detached from the weak segmentation head,the gradients do not back-propagate to the weak segmentation head.

训练过程就是去优化transfer MLP。

pixel-wise loss的公式如下:
image-20210519172148956

其中M表示salient images的gt mask,S_\alphaS_t分别表示2个head部分输出的score maps。\alpha \in [0,1]

总的损失如下:
image-20210519172205770

where λ is equal to 0 if the input comes from box- supervised images, otherwise 1. Last but not least, during the training of the proposed network, there are both box- supervised images and salient images in a mini-batch.

3.4. Training a Proxy Mask R-CNN

作者使用训练好的BoxCaseg model来产生训练集的proxy masks。

we can use the BoxCaseg model to generate proxy masks on the training set.

具体使用BoxCaseg中的salient segmentation head 和transfer head 来得到 预测的mask, 并且根据gt bbox进行crop patches。然后在使用 mask r-cnn 来训练bbox annotations和proxy masks。

In our method, the Mask R-CNN model is trained with the bounding box annotations and the proxy masks, thus called Proxy Mask R-CNN.

为了减少多物体高度重合在一个bbox的error的问题,作者提出了merge and drop方法。

To reduce the errors in the masks generated by BoxCaseg, we propose a merge and drop, i.e., merge masks using the strategy of smaller object better and drop masks via the proxy box agreement rule, which are detailed as follows.

3.4.1 Merging via smaller object better

将pixels属于多个预测物体的时候,本文将他们归结于小物体,作者这么做的原因在于小物体通常都包含在大物体中。

When merging segmentation results, for those pixels belong to multiple predicted masks, we assign them to the smallest objects.

3.4.2 Dropping via proxy box agreement

在使用merging策略后,本文进一步去除低质量的masks。

因为只有GT boxes 而没有GT masks,因此,作者使用proxy bbox,也就是proxy mask的bbox,来计算proxy bbox和GT bbox,作为proxy box agreement,删除较低的 boxIOU的proxy mask。在反向传播过程中忽略。

4. Experiment

4.1 Comparison with the State-of-the-art Methods

image-20210519200019210 image-20210519200446440

4.2. Ablation Studies

4.2.1 Less salient images for joint training

image-20210519200056120

4.2.2 The effectiveness of the three segmentation heads

image-20210519200108368

4.2.3 Sampling strategies for imbalanced training images

image-20210519200208812

4.2.4 The weight of transferred segmentation head

image-20210519200306405

4.2.5 The strategies for merging and deleting proxy masks

image-20210519200346378 image-20210519200319671

全部评论 (0)

还没有任何评论哟~