[TSP-FCOS]Rethinking Transformer-based Set Prediction for Object Detection
本文探讨了基于Transformer的物体检测框架DETR在训练过程中慢收敛的问题,并提出了两种改进方法以加速其收敛性。首先,作者指出DETR中的匈牙利算法(Hungarian loss)由于二分图匹配的不稳定性以及交叉注意力机制的存在导致了慢收敛现象。通过实验发现,在训练初期阶段,匈牙利算法中的二分图匹配不稳定性对收敛速度的影响较小;然而,在后续阶段中交叉注意力模块的稀疏性逐渐增加,并且即使到了100个epoch后仍未能达到稳定状态。此外,在实验中还发现移除交叉注意力模块后可以显著加快DETR的训练速度且不降低其检测性能。
为了缓解这些问题,作者提出了两种改进方法:TSP-FCOS和TSP-RCNN。TSP-FCOS结合了FCOS特征选择器和Transformer编码器,在特征提取过程中选择了多尺度的重要特征点,并通过新的二分图匹配方法加速了训练过程;而TSP-RCNN则采用了区域建议网络(RPN)结合RoIAlign技术来进一步提高检测精度并优化位置编码方案。实验结果表明这两种方法不仅能够显著加快DETR的训练速度,还能保持与原始模型相同的检测性能水平。

文章目录
- 1. 研究背景
- 2. 研究贡献
- 3. 影响DETR收敛速度的因素分析
-
3.1 不稳定二分匹配是否影响收敛速率?
-
3.2 注意力模块是否为主要因素?
-
3.3 DETR真的需要交叉注意力吗?
-
4. The Proposed Methods
-
- 4.1. TSP-FCOS
- 4.2 TSP-RCNN
-
5. Experiments
-
- Ablation study
-
1. Motivation
DETR成功实现了SOTA性能的同时, 其模型训练过程要求更长时间以完成收敛过程
DETR represents a recent advancement in Transformer-based approaches, framing object detection as a set prediction task and demonstrating state-of-the-art performance. However, this necessitates extensive training durations before achieving convergence.
在此基础上,作者致力于探究DETR训练过程中面临的优化困难。Faster RCNN仅需训练30 epochs即可达到预期效果,则需对DETR进行长时间的训练(500 epochs)。
Therefore, how can we improve the training efficiency to achieve faster convergence rates for DETR-like Transformer-based detectors? This presents a significant research challenge and represents the primary focus of this study.
Transformer decoder architecture is a key contributor to the slow convergence issue. Specifically, in DETR, the Hungarian loss, when applied with its bipartite graph structure, exhibits instability that contributes to this problem.
2. Contribution
作者构建了两组模型以针对匈牙利损失以及Transformer交叉注意力机制中的问题进行优化。
In order to address these challenges, we propose two specific solutions: TSP-FCOS (Transformer-based Set Prediction with FCOS) and TSP-RCNN (Transformer-based Set Prediction with RCNN).
新型FoI模块被成功集成到TSP-FCOS中,并实现了对Transformer在多尺度特征编码方面的有效支持。
A unique Feature of Interest (FoI) selection mechanism was devised in TSP-FCOS, aimed at assisting the Transformer encoder in managing complex feature hierarchies.
针对该算法在处理二分图时出现的不稳定性问题,作者同时为这两个模型开发了一个新的二分图匹配,并以加快收敛速度为目标优化了这一过程。
To address the instability issue in bipartite matching under Hungarian loss, we introduce a novel bipartite matching strategy tailored for each of our two distinct models, specifically aimed at expediting the convergence during the training phase.
3. What Causes the Slow Convergence of DETR?
3.1 Does Instability of the Bipartite Matching Affect Convergence?
作者认为匈牙利算法中进行二分图匹配的不稳定在于以下2点:
- This initialization of the bipartite matching is inherently random;
- The matching instability arises from noisy conditions across different training epochs.
该研究者提出采用matching distillation技术来评估相关因素。在教师模型部分,研究者采用了预训练的DETR架构,并通过预训练过程获得的匹配结果被用作学生模型的label assignment基准。为了确保确定性得以维持,在学生模型中关闭了包括dropout和batch normalization等随机组件。
如图1所示,在前25个epoch中展示了original DETR和matching distilled DETR的性能对比结果。通过分析可知,在最初的epochs阶段,matching distillation能够加速收敛。然而,在经过15个epochs后,这种优势不再表现出显著优势。

该方法表明二分图匹配算法的不稳定性仅限于部分慢收敛现象,并且这种情况尤其在训练初期并非其主要影响因素。
3.2 Are the Attention Modules the Main Cause?
Transformer attention maps predominantly constitute the initial phase of model initialization, gradually turning into sparse structures during training. In BERT, replacing parts of the attention heads with sparsity-oriented modules equipped with convolution operations effectively accelerates the training process.
作者专注于稀疏动态的cross-attention模块,并认为该机制是 decoder 中的重要组成部分。具体而言,在解码器中 object queries 通过 encoder 获取物体信息。
Inadequate cross-attention mechanisms may fail to enable accurate context information extraction from images, thereby leading to significant localization issues.
因为attention maps能够被解释为概率分布模型中的一种表现形式,在这种情况下作者提出了利用negative entropy这一指标来量化其稀疏性特征。具体而言,在研究过程中我们首先假定存在一个n \times m维度的注意力矩阵a其中元素a_{i,j}代表了来自source position i到target position j的关注程度。随后通过公式\frac{1}{m} \sum_{j=1}^{m}P(a_{i,j})\log P(a_{i,j})我们计算出每个源位置i \in [n]对应的sparsity值这里假设a_{i,j}遵循一定的概率分布规律。在完成所有计算后我们对每一层的所有注意力头以及所有source postions上的sparsity值进行平均以获得最终结果这一过程有助于更好地理解模型在不同层次上的关注模式分布特征。
从图2可以看出,研究者观察到,在cross-attention机制中稀疏性的系统性提升,并且值得注意的是,在经过100个epoch的训练后,仍未能达到预期效果( plateau ).这表明,在DETR模型中交叉注意力模块的增长对于缓解训练缓慢收敛的问题具有关键作用.

3.3 Does DETR Really Need Cross-attention?
Our next question is: Is it possible to eliminate the cross-attention module from DETR to achieve faster convergence rates while maintaining its prediction capability in object detection? Could it be feasible?
作者开发了一个仅包含编码器的DETR模型,并对其实现与原始DETR在收敛性上的表现进行了深入分析。
对于仅依赖于编码器版本的情况,在目标检测任务中直接使用编码器输出部分,并将每个 feature 传递给检测头以预测结果。图3展示了不同模型架构的对比实验结果:原始 detr、编码器-only detr、tsp-focs 和 tsp-rcnn 的性能表现。

图4展示了原始DETR和encoder-only DETR的AP曲线。第一个图表展示了两者整体的AP曲线,可以看出两者的性能几乎相同。这一发现提示我们可以去除cross-attention部分这一发现具有积极意义。此外,在处理小型物体以及部分中型物体时,encoder-only DETR表现出色,在大型物体方面得分略低。
A possible explanation we believe is that a large object could contain an excessive number of potentially matchable feature points, which present a challenge for the sliding point mechanism within the encoder-only DETR framework.

4. The Proposed Methods

4.1. TSP-FCOS
TSP-FCOS通过融合两种方法的优势实现了目标检测任务的最佳平衡点。在原有的基础之上新增了一个关键组件即特征兴趣(FoI)选择器这一创新性设计能够有效指导Transformer编码器提取多层次特征并引入了一种新型匹配策略以提升模型的整体性能如图5所示其上半部分展示了该网络架构的基本组成框架。
- Backbone and FPN
- Feature extraction subnets
auxiliary subnet (head) 和classification subnet (head).
Their outputs are concatenated and then selected by FoI classifier.
subnet网络结构:
Both the classification subnetwork and the auxiliary subnetwork employ four consecutive 3×3 convolutional layers equipped with 256 channels and group normalization [37].
- Feature of Interest (FoI) classifier
为了提升自注意力机制的效果,研究者设计了一个二进制分类器用于筛选有限特征并将其作为焦点。该二进制FOI分类器采用FCOS标签分配方法进行训练。
Following FoI-based classification, the highest-scoring feature positions are identified as FoIs and subsequently supplied to a Transformer encoder for processing. In the process of selecting FoIs, we identify the top 700 most significant feature positions from the output of a FoI classifier and utilize them as inputs to a Transformer encoder model.
- Transformer encoder
输入:
Following the FoI selection step, the argument to the Transformer encoder comprises a set of FoIs along with their corresponding positional encodings.
输出:
The shared feed forward network processes the encoder's outputs, computing category labels (including "no object") and bounding boxes for each feature of interest.
- Positional encoding
位置编码被定义成一个二维向量形式[PE(x), PE(y)],其中符号":"用于表示将两个向量进行连接运算。根据公式(3),该编码机制由以下等式给出: PE = \text{某个计算过程} 其中变量d_{\text{model}}代表Field of View(FoI)的空间维度。

Faster set prediction training
基于FCOS框架,在该特征点位于该bbox内并处于适当的FPN层级时(即满足特定条件),系统将自动将其标记为对应的gt object。随后,在优化检测结果与ground truth objects之间的匹配关系过程中(具体采用更为严格且精确的成本计算方式——cost-based matching, 公式2),我们将其作为计算匈牙利损失函数(公式1)的基础进行处理。
A feature point is assignable to a ground-truth object if and only if it lies within the bounding box of the object and at an appropriate level within the feature pyramid.

4.2 TSP-RCNN
which demands more computational resources but is capable of detecting objects with higher accuracy.
- Region proposal network
We imitate the design of Faster RCNN [30] and employ a Region Proposal Network (RPN) to generate a set of Regions of Interest (RoIs), which are then further refined.
与传统的FoIs方法不同,在Faster R-CNN框架下提出的区域卷积神经网络(Region of Interest neural network)不仅在每个区域(Region of Interest, RoI)上计算objectness score这一指标,并且还附加了一个预测的目标边界框(bounding box)。通过RoIAlign算法,在多尺度特征图中提取 RoI 的信息。其对应的特征随后经过展平操作并被全连接层输入至 Transformer 编码器构建模块。
- Positional encoding
使用(cx, cy, w, h)表示RoI proposal的位置信息。其中归一化后的中心坐标为(cx, cy) \in [0, 1]^2,宽度和高度分别为(w, h) \in [0,1]^2。作者通过将位置编码表示为[PE(cx): PE(cy): PE(w): PE(h)]来获取每个RoI的位置特征。
Faster set prediction training
不同于TSP-FCOS,在研究中作者采用了基于Faster RCNN的ground truth label assignment方法,并将其应用于加快set预测过程的速度进行训练。
Specifically, a proposal can be assigned to a ground-truth object when the intersection-over-union (IoU) score between their bounding boxes exceeds a threshold of 0.5.
5. Experiments

Ablation study
- Compatibility with deformable convolutions

- Analysis of convergence

- Comparison with State-of-the-Arts

