Few-Shot Object Detection via Classification Refinement and Distractor Retreatment(CVPR. 2021)

1. Motivation
The current state-of-the-art approach TFA [17] is still far away from satisfaction compared with those general data-abundant detection tasks
Given the fact that TFA is IOU-aware but less semantic discriminative, our key insight is to enhance the original classification results by injecting additional category-discriminative information.
we focus on a unique but practically-existed problem of FSOD in this work the presence of distractor samples due to the incomplete annotations, where objects belonging to novel classes can exist in the base set but remain unlabelled.
本文认为TFA可以从2个方面改进①IOU awareness,指的是对于hard negative的鲁棒性;②category discriminability,指的是不同类别之间confusion。
- IOU awareness, i.e., robustness to hard negatives
- category discriminability, i.e., robustness to category confusion.
作者使用的方法是,对于某一个object的预测得分(IOU = 0.4的poor box),对于IOU awareness,,将gt 类别的分类得分消除,其余类别不变;对于category discriminability,将除了gt以外的分类得分消除,gt得分不变。
这里想了很久,感觉需要结合focal loss分析,如果是第一种情况,那么正确类别的得分为0,loss_+就很大,那么网络就倾向于学习这种false positive样本。
如果是第二种情况,那么除了正确类别以外的得分都置0,那么loss_{\_}就是0了,也还是网络会倾向于优化 gt类别。
loss_+ = y \times log(\hat y) \\ loss_{\_} = (1-y) \times log(1 - \hat y)
本文通过图1的实验,分别对2个方面进进行消融实验,得出TFA是IOU-aware 但是less discriminative。

2. Contribution
本文提出了FSCN的结构,来改善FSOD的性能,correction network用于消除category confusion带来的false positive。
We explore the limitations of the classifier rebalancing method (TFA) for FSOD problems and propose a novel few-shot classification refinement framework for exhaustively boosting its FSOD performance. A novel few-shot correction network is developed to achieve great semantic discriminability so as to eliminate false positives caused by category confusion.
在fine-tune过程中使用CGDP方法。
We are the first to address the destructive distractor issue for FSOD. Instead of blindly treating it, a confidence-guided filtering strategy is proposed to exclude the distractors for base detector fine-tuning.
A s emi-supervised distractor utilization strategy is proposed to cooperate with FSCN, which not only stabilizes the training process but also significantly promotes the learning on data-scarce novel classes with no extra annotation cost.
Our proposed FSOD framework achieves the state-of- the-art results in various datasets with remarkable few- shot performance and knowledge retention ability.
3. Method

3.1 Problem Definition
- The definition of the distractor phenomenon in FSOD is that some images {I^{bs}_i} in D_{bs} may possibly contain unlabel objects belonging to C_{nv}.
本文指出,有些D_b中的图片,没有标注D_n的class,也就是在第一阶段base detector training的训练中被当做为背景;但是在第二阶段fine-tuning中,如果还是没有标注,第二阶段的是训练集是novel + base classes,那么第二阶段就会造成性能的下降(fine-tune过程中原来在base images当做背景的novel class就不能当做背景了)
- Therefore, handling the distractor through the delicate algorithm is of great significance to avoid the huge annotation cost and improve the FSOD performance.。
3.2. Framework Overview with Few-Shot Classification Refinement
整个网络结构如图3所示,其中,主要分为了base detectorF_d(\cdot)以及the FSCNF_r。
F_r的输入是base detector中的box 回归分支得到的image, I_p = Cr(I, p),F_r分类输出的是N_t + 1, N_t = N_{bs}+ N_{nv}。

其中s_r是对于所有类别+1的置信度得分,整个的核心思想在于通过FSCN来加强base detector,从而加强proposal 的分类能力。对两类的分类得分做一个加权的操作:

3.3 Few-Shot Correction Network (FSCN)
FSCN包含2个部分,a feature extractor\varphi _v和a linear classifie\varphi _w。
CGNL增大FSCN的感受野。
- In this work, a Compact Generalized Non-Local (CGNL) module [21] is equipped with FSCN to achieve global receptive field
使用cosine similarity metric
- In this work, we introduce cosine similarity metric into FSCN, which can well encourage the unified recogni- tion over all classes.
3.3.1 Network Description

3.3.2 Weight Imprinting for Novel Classes
本文使用加权平均的方法来使得FSCN从base 到 novel calsses。将w从\{w_j\}^{N_{bs}+1}_{j=1}到\{w_j \}^{N_{bs}+N_{nv}+1}_{j=1}进行增加。
- an intuitive way to set their weights is to average the corresponding normalized feature vectors zp

3.4. Semi-Supervised Distractor Utilization Loss
这个loss是用在FSCN中的。
- we formulate it as a semi-supervised learning problem and tackle it with the pseudo hard labeling technique.
- Specifically, given a background proposal I_{bp} from D_{bs}, prediction confidence。

但是如果全把背景类当做是正样本,那么在训练中就没有负样本了。因此把原始的背景类和生成的pseudo classc_p取并集,生成c_b^+。
- However, if all the background proposals are labeled as positive samples of novel classes, there will be no nega- tive samples for FSCN training.
- To address this issue, we further introduce a new concept of background augmentation, which defines a Augmented Background class by merging the original background class C_b with the generated pseudo class C_p, define as C^+_{b}:

distractor utilzation loss 定义如下:
其中,只有在fine-tune过程中的D_{bs}才使用DULoss,而对于D_{nv}来说,还是使用普通的cross entropy。

此外,还提出了UOM 无标注挖掘的方法来自动选择好置信度的无标签物体,具体公式如11所示,类似于IOU方法。
- we propose an unlabelled object mining (UOM) strategy to automatically select the highpossibility unlabelled objects
- a spatial metric M_{sp} is developed for performing effective training sample selection

3.5. Confidence-Guided Dataset Pruning (CGDP)
CGDP是在base detector中添加的方法,具体分为2个阶段,第一个阶段是indicator learning stage,第二个阶段是pruning stage。
CGDP的动机在于从D_{bs}去除干扰物,组成一个干净的子集,来促进对少样本的泛化,利用自监督的方法来清洗干扰样本。
The motivation for CGDP is to form a small clean subset with less distractors from D_{bs} to facilitate the base detector few-shot adaptation.
we aim at developing an automatic pipeline by taking the advantage of self-supervision to effectively clean the distractor samples.
Basically, our proposed CGDP is a two-stage process which consists of the indicator learning stage and the dataset pruning stage.
查询函数Q用来估计一张图片中有干扰物体的似然。
The proposed query function Q(·) that estimates the likelihood of an image to have distractors is defined as

- we construct a class-specific data pool for each category:

挑选top m个图片来组成clean set D_cln。
From each data pool, we select its top m samples which have the lowest likelihood in order to form the clean balanced training set D_{cln}.
Overall, the proposed CGDP pipeline can be formulated as:

