[FGD] Focal and Global Knowledge Distillation for detectors (CVPR. 2022)

1. Motivation
本文作者指出,在目标检测中,tea和stu之间的特征在不同的区域例如前后景的差别是比较大的。
- In this paper, we point out that in object detection, the features of the teacher and student vary greatly in different areas, especially in the fore- ground and background.
如果用同样的方法蒸馏,那么在特征图上不均匀的差异性会导致蒸馏的效果更差。
- If we distill them equally, the uneven differences between feature maps will negatively af- fect the distillation.
因此本文提出了FGD,分为了focal distillation 和 global distillation。
- Thus, we propose Focal and Global Distillation (FGD).
- Focal distillation separates the fore- ground and background, forcing the student to focus on the teacher’s critical pixels and channels.
- Global distilla- tion rebuilds the relation between different pixels and trans- fers it from teachers to students, compensating for missing global information in focal distillation.
从图1可以得出, 学生网络对于 前景的attention map 比背景的响应是更大的。 这就说明了 蒸馏也是存在着 前后景不平衡的影响。

从表1可以得出,作者在采用解耦fg 和bg的特征时,得到的蒸馏效果确实最差的(38.9),因此作者构思了focal dis 来获取关键的pixels 和 channels, 同时使用gcblock 提出 全局特征。

本文对于全局特征提取使用的GC Block。

2. Contribution
We present that the pixels and channels that teacher and student pay attention to are quite different. If we distill the pixels and channels without distinguishing them, it will result in a trivial improvement.
We propose focal and global distillation, which en- ables the student not only to focus on the teacher’s crit- ical pixels and channels, but also to learn the relation between pixels
We verify the effectiveness of our method on various detectors via extensive experiments on the COCO [21], including one-stage, two-stage, anchor-free methods, achieving state-of-the-art performance.
3. Method

作者首先引出了一个例子,在普通的蒸馏特征的公式如下所示:

其中小f是一个adaptation layer 来reshape Ft 和 Fs之间的维度。
但这种方法是对于所有部分同等蒸馏, 缺乏全局之间的联系。
- However, such methods treat all the parts equally and lack the distillation of the global relations between different pixels.
3.1. Focal Distillation
首先使用一个maks 来区分前后景。 前景为1,背景为0

进一步的,为了同等对待 小物体 和 大物体之间gt 的面积,以及前后景的比例,作者提出了一个sacle mask:

- If a pixel belongs to different targets, we choose the smallest box to calculate the S (额外的限制)

空间和通道的特征如下:
Gs 可以理解为 HxWx1, Gc可以理解为 1 x 1 x C的attention map


因此, attention mask可以被定义为:


feature loss 定义为: 其中2项分别是对bg 和fg计算, 通过2个超参数平衡稀疏,并且A^S以及A^C在训练过程中都是使用teahcer模型的。

- Attention loss:
- Besides, we use attention loss Lat to force the student detector to mimic the spatial and channel attention mask of the teacher detector(L1 loss)


3.2 Global loss

- As shown in Fig. 4, we utilize GcBlock [2] to capture the global relation information in a single image and force the student detector to learn the relation from the teacher detector.

student model 总得loss:

4. Experiments
本文使用了 General instance distillation for object detection.(ICCV2021)中的一个方法(inherit strategy),对于相同head 结构的stu和tea,使用tea的权重对stu model 进行初始化。
4.1 Main results


![在这里插入图片描述]()
4.2 Abla
4.2.1 Sensitivity study of different losses

4.2.1 Sensitivity study of focal distillation

4.2.2 Sensitivity study of global distillation

4.2.3 Sensitivity study of T

4.2.4 Sensitivity study of hyper-parameters

