Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training
摘要
Abstract. Although two-stage object detectors have continuously advanced the state-of-the-art performance in recent years, the training process itself is far from crystal. In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. Consequently, we propose Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training. This dynamic design makes better use of the training samples and pushes the detector to fit more high quality samples. Specifically, our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP90 on the MS COCO dataset with no extra overhead. Codes and models are available at
尽管两阶段目标检测器近年来不断提升了最先进的性能,但其训练过程本身仍然存在问题。在这项工作中,我们首先指出固定的网络设置与动态训练过程之间的一致性问题,这极大地影响了性能。例如,固定的标签分配策略和回归损失函数无法适应提案分布的变化,因此对训练高质量检测器有害。因此,我们提出了动态R-CNN,基于训练过程中提案的统计数据自动调整标签分配标准(IoU阈值)和回归损失函数的形状(SmoothL1损失的参数) 。这种动态设计更好地利用了训练样本,并推动检测器适应更多高质量的样本。具体而言,我们的方法在MS COCO数据集上,相对于ResNet-50-FPN基线,提高了1.9%的AP和5.5%的AP90,且没有增加额外的开销。代码和模型可在GitHub - hkzhang95/DynamicRCNN: Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training, ECCV 2020 获取。
1 介绍
Benefiting from the advances in deep convolutional neural networks (CNNs) [22,40,16,14], object detection has made remarkable progress in recent years. Modern detection frameworks can be divided into two major categories of onestage detectors [37,32,29] and two-stage detectors [12,11,38]. And various improvements have been made in recent studies [43,26,47,48,25,24,31,5,20]. In the training procedure of both kinds of pipelines, a classifier and a regressor are adopted respectively to solve the recognition and localization tasks. Therefore, an effective training process plays a crucial role in achieving high quality object detection1.
受益于深度卷积神经网络(CNNs)的进展[22,40,16,14],目标检测近年来取得了显著进步。现代检测框架可以分为两大类:一阶段检测器[37,32,29]和两阶段检测器[12,11,38]。在最近的研究中进行了各种改进[43,26,47,48,25,24,31,5,20]。在这两种流程的训练过程中,分别采用分类器和回归器来解决识别和定位任务。因此,有效的训练过程在实现高质量目标检测中起着至关重要的作用。
Different from the image classification task, the annotations for the classification task in object detection are the ground-truth boxes in the image. So it is not clear how to assign positive and negative labels for the proposals in classifier training since their separation may be ambiguous. The most widely used strategy is to set a threshold for the IoU of the proposal and corresponding ground-truth. As mentioned in Cascade R-CNN [3], training with a certain IoU threshold will lead to a classifier that degrades the performance at other IoUs. However, we cannot directly set a high IoU from the beginning of the training due to the scarcity of positive samples. The solution that Cascade R-CNN provides is to gradually refine the proposals by several stages, which are effective yet time-consuming. As for regressor, the problem is similar. During training, the quality of proposals is improved, however the parameter in SmoothL1 Loss is fixed. Thus it leads to insufficient training for the high quality proposals.
与图像分类任务不同的是,目标检测中分类任务的注释是图像中的真实边界框。因此,对于分类器训练中的提案,如何分配正负标签可能不太清楚,因为它们的分离可能是模糊的。最常用的策略是为提案和相应的真实边界框之间的IoU设置阈值。正如Cascade R-CNN [3]中所提到的,使用特定的IoU阈值进行训练会导致分类器在其他IoU值上性能下降的。然而,由于正样本稀缺,我们不能直接从训练的开始就设置一个高IoU阈值。Cascade R-CNN提供的解决方案是逐渐通过几个阶段细化提案,这是有效的但耗时的方法。对于回归器而言,问题类似。在训练过程中,提案的质量得到改善,但SmoothL1 Loss中的参数是固定的。因此,这导致对高质量提案的训练不足。
To solve this issue, we first examine an overlooked fact that the quality of proposals is indeed improved during training as shown in Figure 1. We can find that even under different IoU thresholds, the number of positives still increases significantly. Inspired by the illuminating observations, we propose Dynamic RCNN, a simple yet effective method to better exploit the dynamic quality of proposals for object detection. It consists of two components: Dynamic Label Assignment and Dynamic SmoothL1 Loss, which are designed for classification and regression branches, respectively. First, to train a better classifier that is discriminative for high IoU proposals, we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution.For regression, we choose to change the shape of the regression loss function to adaptively fit the distribution change of regression label and ensure the contribution of high quality samples to training. In particular, we adjust the β in SmoothL1 Loss based on the regression label distribution, since β actually controls the magnitude of the gradient of small errors (shown in Figure 4).
为了解决这个问题,我们首先检查了一个被忽略的事实,即提案的质量在训练过程中确实得到了提升 ,如图1所示。我们发现,即使在不同的IoU阈值下,正样本的数量仍显著增加。受到这一有启发性观察的启发,我们提出了Dynamic R-CNN,一种简单而有效的方法,可以更好地利用提案的动态质量进行目标检测。它由两个部分组成:动态标签分配和动态SmoothL1损失,分别用于分类和回归分支 。
首先,为了训练一个对高IoU提案具有良好区分能力的分类器,我们根据训练过程中的提案分布逐步调整正样本/负样本的IoU阈值。具体来说,我们将阈值设置为某一百分比提案的IoU,因为它可以反映整体分布的质量。
对于回归,我们选择改变回归损失函数的形状,以自适应地适应回归标签的分布变化,并确保高质量样本对训练的贡献。特别地,我们根据回归标签分布调整SmoothL1损失中的β值,因为β实际上控制了小误差梯度的大小(如图4所示)。

图1. (a) 不同IoU阈值下训练过程中正样本提案的数量。曲线显示在训练过程中正样本的数量显著变化,并且相应的回归标签分布也发生了变化(σx和σw分别代表x和w的标准差)。(b) 训练过程中第75个和第10个最准确提案的IoU和回归标签。这些曲线进一步展示了提案质量的提升。

我们展示了不同β值下的SmoothL1损失的(a) 损失曲线和(b) 梯度曲线。β在R-CNN部分的默认设置为1.0。
By this dynamic scheme, we can not only alleviate the data scarcity issue at the beginning of the training, but also harvest the benefit of high IoU training. These two modules explore different parts of the detector, thus could work collaboratively towards high quality object detection. Furthermore, despite the simplicity of our proposed method, Dynamic R-CNN could bring consistent performance gains on MS COCO [30] with almost no extra computational complexity in training. And during the inference phase, our method does not introduce any additional overhead. Moreover, extensive experiments verify the proposed method could generalize to other baselines with stronger performance.
通过这种动态方案,我们不仅可以缓解训练初期的数据稀缺问题,还可以从高IoU训练中获益。这两个模块探索了检测器的不同部分,因此可以协同工作,实现高质量的目标检测。此外,尽管我们提出的方法简单,但Dynamic R-CNN能够在MS COCO [30]数据集上带来一致的性能提升,并且在训练过程中几乎不会增加额外的计算复杂度。在推理阶段,我们的方法也不会引入任何额外的开销。此外,大量实验验证了所提出的方法可以推广到其他基线模型,并且表现出更强的性能。
2 相关工作
Region-based object detectors. The general practice of region-based object detectors is converting the object detection task into a bounding box classification and a regression problem. In recent years, region-based approaches have been the leading paradigm with top performance. For example, R-CNN [12], Fast R-CNN [11] and Faster R-CNN [38] first generate some candidate region proposals, then randomly sample a small batch with certain foreground-background ratio from all the proposals. These proposals will be fed into a second stage to classify the categories and refine the locations at the same time. Later, some works extended Faster R-CNN to address different problems. R-FCN [7] makes the whole network fully convolutional to improve the speed; and FPN [28] proposes a top-down pathway to combine multi-scale features. Besides, various improvements have been witnessed in recent studies [18,44,26,27,52].
区域感知目标检测器。区域感知目标检测器的一般做法是将目标检测任务转化为边界框分类和回归问题。近年来,基于区域的方法一直是性能最好的范式。例如,R-CNN [12]、Fast R-CNN [11] 和 Faster R-CNN [38] 首先生成一些候选区域提议,然后从所有提议中随机抽样一个具有一定前景-背景比例的小批量。这些提议将被送入第二阶段,同时对类别进行分类和位置进行细化。随后,一些工作将Faster R-CNN扩展到解决不同的问题。R-FCN [7] 将整个网络都设为完全卷积以提高速度;FPN [28] 提出了一个自顶向下的路径来结合多尺度特征。此外,最近的研究中还见证了各种改进。
Classification in object detection. Recent researches focus on improving object classifier from various perspectives [29,19,34,42,25,49,6,17]. The classification scores in detection not only determine the semantic category for each proposal, but also imply the localization accuracy, since Non-Maximum Suppression (NMS) suppresses less confident boxes using more reliable ones. It ranks the resultant boxes first using the classification scores. However, as mentioned in IoUNet [19], the classification score has low correlation with localization accuracy, which leads to noisy ranking and limited performance. Therefore, IoU-Net [19] adopts an extra branch for predicting IoU scores and refining the classification confidence. Softer NMS [17] devises an KL loss to model the variance of bounding box regression directly, and uses that for voting in NMS. Another direction to improve is to raise the IoU threshold for training high quality classifiers, since training with different IoU thresholds will lead to classifiers with corresponding quality. However, as mentioned in Cascade R-CNN [3], directly raising the IoU threshold is impractical due to the vanishing positive samples. Therefore, to produce high quality training samples, some approaches [3,48] adopt sequential stages which are effective yet time-consuming. Essentially, it should be noted that these methods ignore the inherent dynamic property in training procedure which is useful for training high quality classifiers.
目标检测中的分类。 最近的研究从多个角度出发改进了目标分类器 [29,19,34,42,25,49,6,17]。检测中的分类分数不仅决定了每个提议的语义类别,还暗示了定位的准确性,因为非极大值抑制(NMS)使用更可靠的框来抑制置信度较低的框。它首先使用分类分数对结果框进行排序。然而,如IoU-Net [19] 所述,分类分数与定位准确性相关性较低,这导致了排序噪声和性能受限。因此,IoU-Net [19] 采用了一个额外的分支来预测IoU分数并调整分类置信度。Softer NMS [17] 设计了一个KL损失来直接建模边界框回归的方差,并使用它进行NMS投票。另一个改进方向是提高训练高质量分类器的IoU阈值,因为使用不同的IoU阈值进行训练会导致相应质量的分类器。然而,如Cascade R-CNN [3] 所述,直接提高IoU阈值是不现实的,因为正样本会逐渐消失 。因此,为了生成高质量的训练样本,一些方法 [3,48] 采用了顺序阶段,这些方法虽然有效但耗时。从本质上讲,应该注意到这些方法忽略了训练过程中固有的动态特性,这对于训练高质量的分类器是有用的。
Bounding box regression. It has been proved that the performance of models is dependent on the relative weight between losses in multi-task learning [21]. Cascade R-CNN [3] also adopt different regression normalization factors to adjust the aptitude of regression term in different stages. Besides, Libra R-CNN [34] proposes to promote the regression gradients from the accurate samples; and SABL [45] localizes each side of the bounding box with a lightweight two step bucketing scheme for precise localization. However, they mainly focus on a fixed scheme ignoring the dynamic distribution of learning targets during training.
边界框回归。 已经证明,多任务学习中损失之间的相对权重会影响模型的性能 [21]。Cascade R-CNN [3] 还采用不同的回归归一化因子来调整不同阶段的回归项的适应性。此外,Libra R-CNN [34] 提出了促进准确样本回归梯度的方法;而SABL [45] 则采用轻量级的两步桶分配方案来定位边界框的每一边,以实现精确的定位。然而,它们主要关注固定方案,忽略了训练过程中学习目标的动态分布。
Dynamic training. There are various researches following the idea of dynamic training. A widely used example is adjusting the learning rate based on the training iterations [33]. Besides, Curriculum Learning [1] and Self-paced Learning [23] focus on improving the training order of the examples. Moreover, for object detection, hard mining methods [39,29,34] can also be regarded as a dynamic way. However, they don't handle the core issues in object detection such as constant label assignment strategy. Our method is complementary to theirs.
动态训练。有许多研究遵循动态训练的思想。一个广泛使用的例子是根据训练迭代次数调整学习率 [33]。此外,课程学习 [1] 和自主学习 [23] 关注的是改进示例的训练顺序。此外,对于目标检测,硬样本挖掘方法 [39,29,34] 也可以被视为一种动态方式。然而,它们并未处理目标检测中的核心问题,例如固定的标签分配策略。我们的方法是对它们的一种补充。
3 Dynamic Quality in the Training Procedure
Generally speaking, Object detection is complex since it needs to solve two main tasks: recognition and localization. Recognition task needs to distinguish foreground objects from backgrounds and determine the semantic category for them. Besides, the localization task needs to find accurate bounding boxes for different objects. To achieve high-quality object detection, we need to further explore the training process of both two tasks as follows.
总的来说,目标检测是复杂的,因为它需要解决两个主要任务:识别和定位。识别任务需要区分前景对象和背景,并为它们确定语义类别。此外,定位任务需要为不同的对象找到准确的边界框。 为了实现高质量的目标检测,我们需要进一步探索这两个任务的训练过程,具体如下所述。
3.1 Proposal Classification
How to assign labels is an interesting question for the classifier in object detection. It is unique to other classification problems since the annotations are the ground-truth boxes in the image. Obviously, a proposal should be negative if it does not overlap with any ground-truth, and a proposal should be positive if its overlap with a ground-truth is 100%. However, it is a dilemma to define whether a proposal with IoU 0.5 should be labeled as positive or negative.
在目标检测中,如何分配标签对于分类器来说是一个有趣的问题。这与其他分类问题不同,因为注释是图像中的真实边界框。显然,如果一个候选区域与任何真实边界框不重叠,那么它应该是负样本;如果一个候选区域与真实边界框的重叠为100%,那么它应该是正样本。然而,对于 IoU 为 0.5 的候选区域是否应该被标记为正样本还是负样本,这是一个两难的问题。
In Faster R-CNN [38], labels are assigned by comparing the box's highest IoU with ground-truths using a pre-defined IoU threshold. Formally, the paradigm can be formulated as follows (we take a binary classification loss for simplicity):
在 Faster R-CNN 中,通过使用预定义的 IoU 阈值,将标签分配给候选框的最高 IoU 与真实边界框进行比较。形式上,这个范式可以表述如下(为简单起见,我们采用二元分类损失):

这里,b 表示一个边界框,G 表示真实边界框的集合,T+ 和 T− 分别是 IoU 的正样本和负样本阈值。1、0 和 −1 分别表示正样本、负样本和被忽略的样本。对于 Faster R-CNN 的第二阶段,默认情况下将 T+ 和 T− 设置为 0.5。因此,对于正样本和负样本的定义实质上是手工制定的。
Since the goal of classifier is to distinguish the positives and negatives, training with different IoU thresholds will lead to classifiers with corresponding quality [3].Therefore, to achieve high quality object detection, we need to train the classifier with a high IoU threshold. However, as mentioned in Cascade R-CNN, directly raising the IoU threshold is impractical due to the vanishing positive samples. Cascade R-CNN uses several sequential stages to lift the IoU of the proposals, which are effective yet time-consuming.
由于分类器的目标是区分正样本和负样本,因此使用不同的 IoU 阈值进行训练将导致分类器具有相应的质量【3】。因此,为了实现高质量的目标检测,我们需要使用较高的 IoU 阈值来训练分类器。然而,如 Cascade R-CNN 中提到的,直接提高 IoU 阈值是不切实际的,因为这会导致正样本数量减少。Cascade R-CNN 通过多个顺序阶段来提升提议框的 IoU,这种方法虽然有效,但却非常耗时。
So is there a way to get the best of two worlds? As mentioned above, the quality of proposals actually improves along the training. This observation inspires us to take a progressive approach in training: At the beginning, the proposal network is not capable to produce enough high quality proposals, so we use a lower IoU threshold to better accommodate these imperfect proposals in second stage training. As training goes, the quality of proposals improves, we gradually have enough high quality proposals. As a result, we may increase the threshold to better utilize them to train a high quality detector that is more discriminative at higher IoU. We will formulate this process in the following section.
有没有一种方法可以兼顾两全呢?如上所述,提议的质量实际上随着训练而提高。这一观察启发了我们在训练中采取渐进式的方法:在开始阶段,当提议网络可能无法生成足够高质量的提议时,我们使用较低的IoU阈值更好地容纳这些不完美的提议进行第二阶段的训练。随着训练的进行,提议的质量提高,我们逐渐拥有足够高质量的提议。因此,我们可以增加阈值以更好地利用它们,从而训练出更具辨识性的检测器,在更高IoU时更具有区分性。 我们将在下一节中详细阐述这一渐进式训练过程。
3.2 Bounding Box Regression
The task of bounding box regression is to regress the positive candidate bounding box b to a target ground-truth g. This is learned under the supervision of the regression loss function Lreg . To encourage the regression label invariant to scale and location, Lreg operates on the offset ∆ = (δx, δy , δw, δh) defined by
边界框回归的任务是将正例候选边界框 b 回归到目标的地面实况 g。这是在回归损失函数 Lreg 的监督下学习的。为了鼓励回归标签不受尺度和位置的影响,Lreg 在由偏移量 ∆ = (δx, δy , δw, δh) 定义的偏移量上操作。

由于边界框回归是在偏移量上执行的,因此方程(2)的绝对值可能非常小。为了在多任务学习中平衡不同的项,∆通常通过预定义的均值和标准差进行归一化,这在许多工作中被广泛使用[38,28,15]。
However, we discover that the distribution of regression labels are shifting during training. As shown in Figure 2, we calculate the statistics of the regression labels under different iterations and IoU thresholds. First, from the first two columns, we find that under the same IoU threshold for positives, the mean andstdev are decreasing as the training goes due to the improved quality of proposals. With the same normalization factors, the contributions of those high quality samples will be reduced based on the definition of SmoothL1 Loss function, which is harmful to the training of high quality regressors. Moreover, with a higher IoU threshold, the quality of positive samples is further enhanced, thus their contributions are reduced even more, which will greatly limit the overall performance. Therefore, to achieve high quality object detection, we need to fit the distribution change and adjust the shape of regression loss function to compensate for the increasing of high quality proposals.
然而,我们发现在训练过程中,回归标签的分布会发生变化。如图2所示,我们计算了在不同迭代和IoU阈值下的回归标签的统计数据。首先,从前两列可以看出,在相同的IoU阈值下,随着训练的进行,由于提高了候选框的质量,均值和标准差都在减小。在相同的归一化因子下,这些高质量样本的贡献将根据SmoothL1损失函数的定义而减少,这对于训练高质量的回归器是有害的。此外,采用更高的IoU阈值,正样本的质量进一步提高,因此它们的贡献会进一步减少,这将极大地限制整体性能。因此,为了实现高质量的目标检测,我们需要适应分布的变化,并调整回归损失函数的形状,以补偿高质量提案的增加

在不同迭代和IoU阈值下的∆分布(我们为简单起见随机选择了一些点)。第1和2列:在相同的IoU阈值下,随着训练的进行,回归标签会更加集中。第2和3列:在相同的迭代次数下,提高IoU阈值会显著改变分布情况。
4 Dynamic R-CNN
To better exploit the dynamic property of the training procedure, we propose Dynamic R-CNN which is shown in Figure 3. Our key insight is adjusting the second stage classifier and regressor to fit the distribution change of proposals. The two components designed for the classification and localization branch will be elaborated in the following sections.
为了更好地利用训练过程中动态变化的特性,我们提出了Dynamic R-CNN,如图3所示。我们的关键见解是调整第二阶段的分类器和回归器,以适应候选框分布的变化。接下来,我们将详细介绍为分类和定位分支设计的两个组件。

图3. 所提出的Dynamic R-CNN的整体流程。考虑到训练过程的动态性,Dynamic R-CNN由两个主要组件组成:(a) 动态标签分配(DLA)过程和(b) 来自不同角度的动态SmoothL1损失(DSL)。从(a)的左侧部分我们可以看到,随着训练的进行,高质量的候选框数量增加。随着候选框质量的提高,DLA将基于候选框分布自动提高IoU阈值。然后,DLA为候选框分配正(绿色)和负(红色)标签,如(a)的右侧部分所示。同时,在(b)中,为了适应分布的变化并补偿高质量候选框的增加,回归损失函数的形状也相应地进行了调整。最好用彩色查看。
4.1 Dynamic Label Assignment
The Dynamic Label Assignment (DLA) process is illustrated in Figure 3 (a). Based on the common practice of label assignment in Equation (1) in object detection, the DLA module can be formulated as follows:
动态标签分配(DLA)过程如图3(a)所示。基于目标检测中公式(1)的标签分配常见做法,DLA模块可以公式化如下


where Tnow stands for the current IoU threshold. Considering the dynamic property in training, the distribution of proposals is changing over time. Our DLA updates the Tnow automatically based on the statistics of proposals to fit this distribution change. Specifically, we first calculate the IoUs I between proposals and their target ground-truths, and then select the KI -th largest value from Ias the threshold Tnow. As the training goes, Tnow will increase gradually which reflects the improved quality of proposals. In practice, we first calculate the KI - th largest value in each batch, and then update Tnow every C iterations using the mean of them to enhance the robustness of the training. It should be noted that the calculation of IoUs is already done by the original method, so there is almost no additional complexity in our method. The resultant IoU thresholds used in training are illustrated in Figure 3 (a).
其中,Tnow代表当前的IoU阈值。考虑到训练过程中的动态特性,建议的分布会随时间变化。我们的DLA模块根据提案的统计数据自动更新Tnow以适应这种分布变化。具体而言,**我们首先计算提案与其目标真值之间的IoUs
,然后选择
中第
大的值作为阈值Tnow。随着训练的进行,Tnow将逐渐增加,这反映了提案质量的提高。在实际操作中,我们首先在每个批次中计算第
大的值,然后每C次迭代更新一次Tnow,使用这些值的平均值以增强训练的鲁棒性。**需要注意的是,IoU的计算已经由原方法完成,因此我们的方法几乎不增加额外的复杂性。训练中使用的IoU阈值如图3(a)所示。
4.2 Dynamic SmoothL1 Loss
The localization task for object detection is supervised by the commonly used SmoothL1 Loss, which can be formulated as follows:
目标检测中的定位任务由常用的SmoothL1损失进行监督,可以表示为:

这里的 x代表回归标签。
是一个超参数控制,其中我们应该使用更软的损失函数,如 l1 损失而不是原始的 l2 损失。考虑到训练的稳健性,
默认设置为1.0,以防止由于早期训练网络不足而导致的损失爆炸。我们还在图4中说明了
的影响,其中改变
导致损失和梯度的不同曲线。很容易发现,较小的
实际上加速了梯度幅度的饱和,从而使更准确的样本对网络训练产生更多的贡献。

我们展示了 (a) 损失和 (b) 这里具有不同 β 的 SmoothL1 损失梯度曲线。β 在 R-CNN 部分设置为 1.0。
As analyzed in Section 3.2, we need to fit the distribution change and adjust the regression loss function to compensate for the high quality samples. So we propose Dynamic SmoothL1 Loss (DSL) to change the shape of loss function to gradually focus on high quality samples as follows:
如在第3.2节分析的那样,我们需要适应回归标签分布的变化,并调整回归损失函数以有效处理高质量的样本。因此,我们提出了动态平滑L1损失(DSL),来改变损失函数的形状,逐渐聚焦于高质量的样本,具体如下所示:

Similar to DLA, DSL will change the value of βnow according to the statistics of regression labels which can reflect the localization accuracy. To be more specific, we first obtain the regression labels E between proposals and their target ground-truths, then select the Kβ -th smallest value from E to update the βnowin Equation (4). Similarly, we also update the βnow every C iterations using the median of the Kβ -th smallest label in each batch. We choose median instead of mean as in the classification because we find more outliers in regression labels. Through this dynamic way, appropriate βnow will be adopted automatically as shown in Figure 3 (b), which will better exploit the training samples and lead to a high quality regressor.
与DLA类似,DSL会根据回归标签的统计信息动态调整 βnow,从而反映出定位精度。具体来说,我们首先计算候选提议与其目标真实边界框之间的回归误差 E,然后从E中选择第
小的值来更新 Equation (4) 中的
。与DLA类似,我们每隔 C次迭代更新一次
,使用每个批次中第
个最小标签的中位数。我们之所以选择中位数而不是像分类中那样选择均值,是因为在回归标签中通常存在更多的异常值。 通过这种动态方式,适当的
将自动调整,如图 3 (b) 所示。这种自适应调整有助于有效利用训练样本,并促进高质量回归器的训练。

To summarize the whole method, we describe the proposed Dynamic R-CNN in Algorithm 1. Besides the proposals P and ground-truths G, Dynamic R-CNN has three hyperparamters: IoU threshold top-k KI , β top-k Kβ and update iteration count C. Note that compared with baseline, we only introduce one additional hyperparameter. And we will show soon the results are actually quite robust to the choice of these hyperparameters.
为了总结整个方法,我们描述了提出的Dynamic R-CNN如算法1所示。除了提议P和真实边界框G外,Dynamic R-CNN还有三个超参数:IoU阈值top-k
,β top-k
和更新迭代次数C。值得注意的是,与基线相比,我们只引入了一个额外的超参数。很快我们将展示结果实际上对这些超参数的选择非常稳健。

输入:
生成的候选区域集合P
实际的目标区域集合G。
用于更新IoU阈值的前k个IoU。
用于更新SmoothL1损失参数的前k个回归误差。
用于控制更新频率的迭代计数。C
输出:
训练后的目标检测器D
1:初始化IoU阈值和SmoothL1损失参数β为当前值Tnow , βnow
2:建立两个空集合
,用于记录IoU值和回归标签。
3:迭代过程:循环从0到最大迭代次数
4:获取候选区域P和实际目标区域G之间的匹配IoUs I和回归标签 E。
5:基于
,
选择阈值Ik , Ek。
6:记录相应的值,将Ik添加到SI,将Ek添加到SE。
7:每 C次迭代后:如果当前迭代数
,则进行更新:
8:更新IoU阈值:
。
9:更新SmoothL1损失参数β:
。
10:清空集合 
11:用更新后的
训练网络。
5 Experiments
5.1 Dataset and Evaluation Metrics
Experimental results are mainly evaluated on the bounding box detection track of the challenging MS COCO [30] 2017 dataset. Following the common practice [29,15], we use the COCO train split (∼118k images) for training and report the ablation studies on the val split (5k images). We also submit our main results to the evaluation server for the final performance on the test-devsplit, which has no disclosed labels. The COCO-style Average Precision (AP) is chosen as the main evaluation metric which averages AP across IoU thresholds from 0.5 to 0.95 with an interval of 0.05. We also include other metrics to better understand the behavior of the proposed method.
实验结果主要在具有挑战性的MS COCO [30] 2017数据集的边界框检测任务上进行评估。遵循通常的做法[29,15],我们使用COCO训练集(约118,000张图像)进行训练,并在验证集(5,000张图像)上进行消融研究。我们还将主要结果提交到评估服务器,以获取在test-dev分割上的最终性能,该分割没有公开标签。我们选择COCO风格的平均精度(Average Precision, AP)作为主要评估指标,它在 0.5 到 0.95 的 IoU 阈值上平均 AP,间隔为0.05。我们还包括其他指标,以更好地了解所提方法的行为。
5.2 Implementation Details
For fair comparisons, all experiments are implemented on PyTorch [35] and follow the settings in maskrcnn-benchmark2 and SimpleDet [4]. We adopt FPNbased Faster R-CNN [38,28] with ResNet-50 [16] model pre-trained on ImageNet [9] as our baseline. All models are trained on the COCO 2017 train set and tested on val set with image short size at 800 pixels unless noted. Due to the scarcity of positives in the training procedure, we set the NMS threshold of RPN to 0.85 instead of 0.7 for all the experiments.
为了公平比较,所有实验都在PyTorch [35]上实现,并遵循maskrcnn-benchmark和SimpleDet [4]中的设置。我们采用基于FPN的Faster R-CNN [38,28],使用在ImageNet [9]上预训练的ResNet-50 [16]模型作为基线。所有模型都在COCO 2017训练集上训练,并在val集上进行测试,图像短边大小为800像素,除非另有说明。由于在训练过程中正样本的稀缺性,我们在所有实验中将 RPN 的 NMS 阈值设置为 0.85,而不是 0.7
5.3 Main Results
We compare Dynamic R-CNN with corresponding baselines on COCO test-devset in Table 1. For fair comparisons, We report our re-implemented results.
First, we prove that our method can work on different backbones. Dynamic R-CNN achieves 39.1% AP with ResNet-50 [16], which is 1.8 points higher than the FPN-based Faster R-CNN baseline. With a stronger backbone like ResNet101, Dynamic R-CNN can also achieve consistent gains (+1.9 points).
Then, our dynamic design is also compatible with other training and testing skills. The results are consistently improved by progressively adding in 2×longer training schedule, multi-scale training (extra 1.5× longer training schedule), multi-scale testing and deformable convolution [52]. With the best combination, out Dynamic R-CNN achieves 49.2% AP, which is still 2.3 points higher than the Faster R-CNN baseline.
These results show the effectiveness and robustness of our method since it can work together with different backbones and multiple training and testing skills. It should also be noted that the performance gains are almost free.
我们在COCO test-dev 数据集上将Dynamic R-CNN与相应的基线进行比较,结果如表1所示。为了公平比较,我们报告了我们重新实现的结果。
首先,我们证明了我们的方法可以适用于不同的骨干网络。Dynamic R-CNN在ResNet-50 [16]上达到了39.1%的AP,比基线FPN-based Faster R-CNN高出了1.8个百分点。使用更强大的骨干网络如ResNet101,Dynamic R-CNN也可以获得一致的提升(+1.9个百分点)。
然后,我们的动态设计还兼容其他的训练和测试技巧。通过逐步添加2倍更长的训练计划,多尺度训练(额外1.5倍更长的训练计划),多尺度测试和可变形卷积 [52],结果不断提高。在最佳组合下,我们的Dynamic R-CNN达到了49.2%的AP,仍然比Faster R-CNN基线高出2.3个百分点。
这些结果表明了我们的方法的有效性和稳健性,因为它可以与不同的骨干网络和多种训练和测试技巧配合使用。值得注意的是,性能的提升几乎是免费的。

在COCO test-dev数据集上,我们对不同基线(我们重新实现的)进行了比较。表中的"MST"和"*"分别表示多尺度训练和测试。"2×"和"3×"是指训练计划,将迭代次数增加了2倍或3倍。
5.4 Ablation Experiments
To show the effectiveness of each proposed component, we report the overall ablation studies in Table 2
为了展示每个提出的组件的有效性,我们在表2中报告了整体消融研究结果。

- Dynamic Label Assignment (DLA). DLA brings 1.2 points higher box AP than the ResNet-50-FPN baseline. To be more specific, results in higher IoU metrics are consistently improved, especially for the 2.9 points gains in AP90. It proves the effectiveness of our method for pushing the classifier to be more discriminative at higher IoU thresholds.
- 动态标签分配(DLA)。DLA相比ResNet-50-FPN基准模型提高了1.2个box AP。更具体地说,更高IoU度量中的结果持续改善,特别是AP90提高了2.9个百分点。这证明了我们的方法对于促使分类器在更高IoU阈值上更具有区分性的有效性。
- Dynamic SmoothL1 Loss (DSL). DSL improves the box AP from 37.0 to 38.0. Results in higher IoU metrics like AP80 and AP90 are hugely improved, which validates the effectiveness of changing the loss function to compensate for the high quality samples during training. Moreover, as analyzed in Section 3.2, with DLA the quality of positives is further improved thus their contributions are reduced even more. So applying DSL on DLA will also bring reasonable gains especially on high quality metrics. To sum up, Dynamic R-CNN improves the baseline by 1.9 points AP and 5.5 points AP90.
2.动态 SmoothL1 损失(DSL)。DSL 将box AP 从 37.0 提高到 38.0。高 IoU 指标,如 AP80 和 AP90,显著提高,这验证了在训练过程中改变损失函数以补偿高质量样本的有效性。此外,如第 3.2 节所分析的,DLA 提高了正样本的质量,因此它们的贡献进一步减少。因此,在 DLA 上应用 DSL 也会带来合理的增益,特别是在高质量指标上。总之,Dynamic R-CNN 将基准提高了 1.9 个 AP 点和 5.5 个 AP90 点。
- Illustration of dynamic training. To further illustrate the dynamics in the training procedure, we show the trends of IoU threshold and SmoothL1β under different settings based on our method in Figure 5. Here we clip the values of IoU threshold and β to 0.4 and 1.0 respectively at the beginning of training. Regardless of the specific values of KI and Kβ , the overall trend of IoU threshold is increasing while that for SmoothL1 β is decreasing during training. These results again verify the proposed method work as expected.
动态训练的说明。为了进一步说明训练过程中的动态性,我们在图 5 中基于我们的方法展示了不同设置下 IoU 阈值和 SmoothL1β 的趋势。在训练开始时,我们将 IoU 阈值和 β 的值分别剪切到 0.4 和 1.0。无论 KI 和 Kβ 的具体值如何,IoU 阈值的总体趋势是增加的,而 SmoothL1 β 的趋势是减小的。这些结果再次验证了所提出的方法按预期工作

(a) IoU 阈值和 (b) 基于我们的方法的不同设置下 SmoothL1 β 的趋势。显然,分布在训练期间发生了很大的变化。
5.5 Studies on the effect of hyperparameters
Ablation study on KI in DLA. Experimental results on different KI are shown in Table 3. Compared to the baseline, DLA can achieve consistent gains in AP regardless of the choice of KI . These results prove the universality ofKI . Moreover, the performance on various metrics are changed under differentKI . Choosing KI as 64/75/100 means that nearly 12.5%/15%/20% of the whole batch are selected as positives. Generally speaking, setting a smaller KI will increase the quality of selected samples, which will lead to better accuracy under higher metrics like AP90. On the contrary, adopting a larger KI will be more helpful for the metrics at lower IoU. Finally, we find that setting KI as 75 achieves the best trade-off and use it as the default value for further experiments. All these ablations prove the effectiveness and robustness of the DLA part.
消融研究中关于DLA中KI的实验结果如表3所示。与基线相比,无论选择哪个KI,DLA都可以在AP上实现一致的增益。这些结果证明了KI的普适性。此外,在不同KI下,各种指标的表现都有所变化。选择64/75/100作为KI意味着几乎整个批次的12.5%/15%/20%被选为正样本。总体来说,设置较小的KI会提高所选样本的质量,这将导致在更高IoU下如AP90等指标上更好的准确性 。相反,采用较大的KI对于较低IoU的指标更有帮助 。最后,我们发现将KI设置为75达到了最佳的平衡点,并将其作为进一步实验的默认值。所有这些消融研究证明了DLA部分的有效性和鲁棒性。

Ablation study on Kβ in DSL. As shown in Table 5, we first try different β on Faster R-CNN and empirically find that a smaller β leads to better performance. Then, experiments under different Kβ are provided to show the effects of Kβ . Regardless of the certain value of Kβ , DSL can achieve consistent improvements compared with various fine-tuned baselines. Specifically, with our best setting, DSL can bring 1.0 point higher AP than the baseline, and the improvement mainly lies in the high quality metrics like AP90 (+3.5 points). These experimental results prove that our DSL is effective in compensating for high quality samples and can lead to a better regressor due to the advanced dynamic design.
DSL中Kβ的消融研究。如表5所示,我们首先在Faster R-CNN上尝试不同的β,并经验性地发现较小的β会带来更好的性能。然后,我们提供了在不同Kβ下的实验结果,以展示Kβ的影响。无论Kβ的具体值如何,DSL都能够与各种经过微调的基线相比实现一致的改进。具体而言,采用我们的最佳设置,DSL可以比基线带来1.0个更高的AP,而改进主要集中在高质量指标,如AP90(+3.5个点)。这些实验结果证明了我们的DSL在补偿高质量样本方面是有效的,并且由于先进的动态设计,可以带来更好的回归器。

Ablation study on iteration count C. Due to the concern of robustness, we update Tnow and βnow every C iterations using the statistics in the last interval. To show the effects of different iteration count C, we try different values of C on the proposed method. As shown in Table 4, setting C as 20, 100 and 500 leads to very similar results, which proves the robustness to this hyperparameter.
C的迭代次数消融研究。由于对稳健性的考虑,我们使用最后一个间隔的统计数据,每C次迭代更新Tnow和βnow。为了展示不同迭代次数C的影响,我们在提出的方法上尝试了不同的C值。如表4所示,将C设置为20、100和500会导致非常相似的结果,这证明了对这个超参数的鲁棒性。

Complexity and speed. As shown in Algorithm 1, the main computational complexity of our method lies in the calculations of IoUs and regression labels, which are already done by the original method. Thus the additional overhead only lies in calculating the mean or median of a short vector, which basically does not increase the training time. Moreover, since our method only changes the training procedure, obviously the inference speed will not be slowed down.
复杂度和速度。如算法1所示,我们方法的主要计算复杂度在于计算IoU和回归标签,这些计算已经由原始方法完成。因此,额外的开销仅仅在于计算一个短向量的均值或中位数,这基本上不会增加训练时间。此外,由于我们的方法只是改变了训练过程,显然推理速度不会变慢
Our advantage compared to other high quality detectors like Cascade RCNN is the efficiency. Cascade R-CNN increases the training time and slows down the inference speed while our method does not. Specifically, as shown in Table 6, Dynamic R-CNN achieves 13.9 FPS, which is ∼1.25 times faster than Cascade R-CNN (11.2 FPS) under ResNet-50-FPN backbone. Moreover, with larger heads, the cascade manner will further slow down the speed. Dynamic Mask R-CNN runs ∼1.5 times faster than Cascade Mask R-CNN. Note that the difference will be more apparent as the backbone gets smaller (∼1.74 times faster, 13.6 FPS vs 7.8 FPS under ResNet-18 backbone with mask head), since the main overhead of Cascade R-CNN is the two additional headers.
我们相对于其他高质量检测器(如级联RCNN)的优势在于效率 。级联RCNN增加了训练时间并减慢了推理速度,而我们的方法没有这种情况。具体而言,如表6所示,动态RCNN在ResNet-50-FPN骨干网络下达到了13.9 FPS,比级联RCNN(11.2 FPS)快约1.25倍。此外,使用更大的头部,级联方式会进一步减慢速度。动态Mask RCNN的速度比级联Mask RCNN快约1.5倍。值得注意的是,随着骨干网络变小(使用ResNet-18骨干网络和mask头部时,大约快1.74倍,13.6 FPS比7.8 FPS),差异将变得更加明显,因为级联RCNN的主要开销是两个额外的头部。
5.6 Universality
Since the viewpoint of dynamic training is a general concept, we believe that it can be adopted in different methods. To validate the universality, we further apply the dynamic design on Mask R-CNN with different backbones. As shown in Table 7, adopting the dynamic design can not only bring ∼2.0 points higher box AP but also improve the instance segmentation results regardless of backbones. Note that we only adopt the DLA and DSL which are designed for object detection, so these results further demonstrate the universality and effectiveness of our dynamic design on improving training procedure for current detectors.
由于动态训练的观点是一个通用概念,我们相信它可以应用于不同的方法。为了验证其普适性,我们进一步将动态设计应用于具有不同骨干网络的Mask R-CNN上。如表7所示,采用动态设计不仅可以带来约2.0个点的更高边界AP,还可以改善实例分割结果,而不受骨干网络的影响。需要注意的是,我们只采用了为目标检测而设计的DLA和DSL,因此这些结果进一步证明了我们的动态设计对于改善当前检测器的训练过程的普适性和有效性。

5.7 Comparison with State-of-the-Arts
We compare Dynamic R-CNN with the state-of-the-art object detectors on COCO test-dev set in Table 8. Considering that various backbones and training/testing settings are adopted by different detectors (including deformable convolutions [8,52], image pyramid scheme [41], large-batch Batch Normalization [36] and Soft-NMS [2]), we report the results of our method with two types.
Dynamic R-CNN applies our method on FPN-based Faster R-CNN with ResNet-101 as backbone, and it can achieve 42.0% AP without bells and whistles. Dynamic R-CNN* adopts image pyramid scheme (multi-scale training and testing), deformable convolutions and Soft-NMS. It further improves the results to 50.1% AP, outperforming all the previous detectors.
我们在表8中将Dynamic R-CNN与最先进的目标检测器在COCO test-dev数据集上进行了比较。考虑到不同检测器采用了各种骨干网络和训练/测试设置(包括可变形卷积[8,52]、图像金字塔方案[41]、大批量批量归一化[36]和软非极大值抑制[2]),我们提供了两种类型的方法的结果。
Dynamic R-CNN将我们的方法应用于基于FPN的Faster R-CNN,骨干网络为ResNet-101,它可以在不添加额外功能的情况下达到42.0%的AP。Dynamic R-CNN*采用图像金字塔方案(多尺度训练和测试)、可变形卷积和软非极大值抑制。它将结果进一步提高到50.1%的AP,优于所有先前的检测器。
