Advertisement

论文翻译——YOLO9000: Better, Faster, Stronger

阅读量:

摘要

翻译:

本节介绍YOLO9000系统——一个前沿的实时物体检测平台。该系统具备显著的优势:它不仅支持超过9,000种物体类别识别(AP),而且在PASCAL VOC和COCO等标准测试基准上均表现优异。基于现有研究工作的基础上,在基础算法层面我们进行了多方面的优化创新——既有理论上的突破性创新(AP),也有实践上的技术改进(AP)。经过优化后的YOLOv2版本,在PASCAL VOC和COCO测试基准上的表现进一步提升:采用多尺度特征提取策略后,在不同分辨率下都能实现简便的速度与精度折衷方案(AP)。当运行速度达到67帧/秒时,在2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 13.48 mAP(VOC-2mAP@13.48)的基础上实现了令人满意的性能表现;而当运行速度进一步降至4帧/秒时,则能突破现有最先进的方法如ResNet和SSD的Faster RCNN模型所设置的精度上限(AP),同时仍能维持较高水平的速度表现(AP)。此外我们还开发了一种新型的目标检测与分类任务联合优化训练方法:通过结合目标检测与分类任务的联合优化训练方法(JOTM),我们在COCO检测数据集以及ImageNet分类数据集上完成了对YOLO9,558系统的联合训练实验验证了该系统的有效性:通过在COCO检测数据集上测试发现该系统能够实现19.7 mAP(COCO AP@19.7)的表现效果;而在ImageNet分类数据集上的实验表明该系统的分类能力达到了16.3 mAP(ImageNet AP@16.3)的效果水平。值得注意的是尽管只有约44%的实际标注样本被用于检测任务建模中——这在实际应用中往往面临资源限制的问题——但该系统却能在未标记的数据中自动学习并识别出其他类别的目标:这使得YOLO9558系统能够在仅使用约44%标注样本的情况下实现对其他剩余类别的目标检测任务建模,并且其计算效率依然保持了较高的水平

We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP , outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

1 说明

翻译:

通用目标检测应该具备高效可靠的性能,并能识别不同种类的目标。自神经网络技术引入以来,检测框架呈现出快速提升的趋势。然而,在实际应用中,大多数检测方法仍仅针对少数特定对象进行设计。

与传统的分类与标注任务相比,在目标检测领域目前所获得的数据集数量较为有限。大多数目标检测数据集包含数十万至几百万张图像,并且标签数量在几十到几百之间,并引用文献[3][10][2]。另一方面,在分类领域中拥有数百 million幅图像,并且类别数量达到几十万至几个百万。

General-purpose object detection requires being both swift and precise while also being capable of identifying numerous different objects. Following the advent of neural networks, detection systems have evolved into significantly faster and more precise configurations. Despite this progress, many current detection methods remain restricted to a narrow range of object types.

Current object detection datasets are relatively small in comparison to classification and tagging-related datasets. Typically, detection datasets include a range from several hundred to several thousand images accompanied by between twenty and one hundred tags [3][10][2]. Classification datasets, on the other hand, have been established as containing up to several hundred thousand categories alongside millions of images [20][2].

翻译:

我们旨在评估目标分类阶段的数据规模。然而,在图像标注方面存在显著差异:用于检测的任务成本远高于用于分类或直接进行标注的任务(因为标注通常由用户免费提供)。基于此,在可预见的未来内,我们预计难以获得两个数据集在数量上的匹配。

The goal is to extend detection tasks to match the scope of object classification. However, becoming significantly more costly compared to labelling images for classification or tagging (where tags can typically be supplied by users at no cost). This suggests that achieving comparable dataset scales for detection tasks compared to classification datasets in the near future is highly improbable.

翻译:

我们开发了一种新型目标检测系统框架,在已有大规模分类数据的基础上实现了对现有检测系统性能的显著提升。该方法基于层次视图结构设计目标分类系统,并通过多维度特征提取技术实现对复杂场景的精准识别能力。通过结合不同数据集与联合训练策略,在提升模型泛化能力的同时实现了对目标检测任务的高效求解。该系统采用基于标记的目标定位算法,在有限标注条件下实现了高精度的目标定位效果;同时通过引入多标签分类模型显著提升了类别识别能力与抗干扰性能。通过该框架构建的YOLO9000实时目标检测器不仅支持超过9000种独立的目标类别识别任务;更能在实时性与准确性之间取得良好的平衡点以满足实际应用需求。在本研究中我们重点探讨了以下两个关键技术:首先是在基础YOLO架构基础上设计并实现了YOLOv2模型;其次是在ImageNet预训练权重基础上结合COCO标注集构建了高效的特征提取网络体系;最后通过自定义的数据集组合方式与联合优化算法实现了对深度学习模型参数的有效配置与优化。

原文: 可修改后右键重新翻译

We introduce a novel approach aimed at maximizing the potential of existing classification data. By employing a hierarchical structure for object classification, we integrate diverse datasets to enhance our system’s capabilities. Additionally, we introduce a joint training framework that enables simultaneous optimization for both detection and classification tasks. Through the use of classification images, our approach significantly expands its capacity to recognize objects accurately while increasing its robustness through diverse data sources. The implementation of these techniques culminates in the development of YOLO9000, a real-time object detector capable of identifying over 9,572 distinct object categories with high precision. This achievement is built upon two key advancements: first, refining the base YOLO system into YOLOv2; second, leveraging advanced dataset combination methods alongside our joint training algorithm to expand recognition capabilities beyond 9k categories from ImageNet and COCO datasets. All code and pre-trained models are accessible online at http://pjreddie.com/yolo9k/.

2 更好

翻译:

YOLO相对而言具有一定的局限性,在先进检测系统面前显得不够完善。通过对比分析YOLO与Fast R-CNN之间的性能差异可以看出,在定位精度方面存在明显不足。相较于基于区域提案的方法而言,在目标召回率方面表现同样不尽如人意。因此为了平衡分类精度与目标检测能力我们需要在保证分类准确性的同时努力提升召回率与定位精度水平。计算机视觉领域普遍倾向于采用更大更深的网络架构[6][18][17]这些研究方向往往能够带来更好的性能表现但同时也伴随着计算资源需求的巨大增加。然而采用YOLOv2后我们希望能够开发出一个在速度与准确性之间取得更好平衡的新方法为此我们需要采取一些特殊的技术手段以实现这一目标为此我们将对现有的技术架构进行重新设计并融合创新性的思路从而进一步优化其性能表现最终的结果摘要见表2

YOLO相较于目前最先进的检测系统存在多方面的不足之处【State-of-the-Art

翻译:

本节介绍**批处理规范化(Batch Normalization, BN)**的基本原理和实现过程。 该技术有助于提升模型训练的稳定性与收敛速度。 相较于传统正则化方法(如Dropout),BN能够有效减少对额外正则化手段的需求。 将BN应用于YOLO网络的所有卷积层设计中,在mAP指标上较传统方法提升了约2.0%的成绩。 实验结果表明,在mAP指标上较传统方法提升了约2.0%的成绩;此外,在提升训练效率的同时,BN也有助于规范化的实现。

高分辨率的分类器 。所有先进的目标检测方法都基于ImageNet[16]上的预训练分类器。自AlexNet起始以来,默认情况下使用的分类器处理其输入图像尺寸低于^{(a)} 256×256 [8]. 初始版本中的YOLO模型是在^{(b)} 224×^{(c)} 224的空间分辨率下构建的目标检测网络,并将其分辨率达到^{(d)} 448来实现目标检测. 这意味着该网络需同步配置为执行目标检测任务并适应新的输入分辨率.

对于YOLOv2算法,在ImageNet数据库中进行了一个完整的训练阶段,在一个典型的高分辨率输入下实现了精确优化。随后,在实际检测过程中对该模型进行了进一步微调训练。这种通过高分辨率数据训练的方法显著提升了我们的目标检测精度(mAP),达到了3.7×10²%的水平。

Batch Normalization. Batch normalization is effective in enhancing convergence and serves as an alternative to other regularization techniques [7]. Incorporating batch normalization into every convolutional layer within YOLO architecture results in a notable increase, exceeding 2%, in the mAP score. Furthermore, batch normalization facilitates regularization within neural networks. Equipped with batch normalization, one can eliminate dropout layers without compromising model performance.

High Resolution Classifier. Most state-of-the-art detection methods utilize classifiers pre-trained on ImageNet [16]. Starting with AlexNet, most classifiers rely on input images smaller than 256 × 256 [8]. Originally, YOLO was designed to train a classifier network at a resolution of 224 × 224, subsequently scaling up to handle higher resolutions like 448 for detection. This necessitates that the network simultaneously assumes responsibility for object detection tasks while adapting its input resolution.

To train YOLOv2, we initially fine-tuned a classification network at the full 448×448 resolution for 10 epochs on ImageNet. This allows the network sufficient time to adjust its filters, enhancing their performance on higher resolution inputs. Subsequently, we applied fine-tuning to this optimized network for detection tasks. The high-resolution classification network achieved a notable improvement, increasing our mean average precision (mAP) by nearly 4%.

翻译:

采用锚框卷积 。YOLO通过卷积特征提取器中的全连接层直接输出边界框的位置参数。而Faster R-CNN则采用handcrafted proposals来选择候选区域而非直接输出坐标。在Faster R-CNN架构中,区域建议网络(RPN)仅依赖于卷积操作来推断锚框的偏移量与置信度值。由于所有的预测过程均基于卷积操作,在每一个特征图的位置上都会进行计算以生成候选框及其对应置信度评分。值得注意的是这种方法避免了复杂的坐标计算从而使得网络的学习过程更加高效和便捷

Convolutional With Anchor Boxes. YOLO makes predictions about bounding box positions through FC layers built atop conv features. In contrast, Faster R-CNN employs handcrafted proposals for box detection. When relying solely on conv layers RPN within Faster R-CNN estimates offset adjustments and confidence levels relative to predefined anchors. Given that predictions occur via conv operations RPN applies these adjustments across every spatial point in a feature map. By estimating offset shifts rather than absolute positions simplifies learning tasks for neural networks, making training more efficient.

翻译:

我们从YOLO模型中剥离了全连接层,并借助锚框来定位边界;随后,在网络结构中去除了一个池化层以提升边缘检测效果;此外,在处理过程中采用了416×416像素的输入图像;这一做法的原因是我们需要确保特征图具有奇数尺寸;特别强调应在中心位置设置单独单元以提高准确性;从而生成了一个13×13的空间位置特征图矩阵

By removing the fully connected layers from YOLO, we employ anchor boxes to predict bounding boxes. By eliminating one pooling layer, we increase the spatial resolution of the network's convolutional outputs. Additionally, we reduce the input size from 448×448 to 416 pixels per side. Specifically, this is done to ensure that our feature map has an odd number of locations, resulting in a single central cell for object detection. Since YOLO employs convolutional layers that downsample images by a factor of 32, using an input size of 416 pixels yields a feature map dimension of 13 \times 13.

翻译:

当转向锚框时,在解耦类预测机制与空间位置关系的过程中

When transitioning to anchor boxes, we also separate the class prediction mechanism from spatial position, predicting both class and objectness for each anchor box. Inspired by YOLO, IOU-based predictions calculate the intersection over union between ground truth and proposed boxes. Meanwhile, class predictions determine the conditional probability of each class given an object's presence. Our experiments reveal a small reduction in accuracy when using anchor boxes—YOLO generates 98 boxes per image, whereas our model outputs over a thousand. Without anchor boxes, our intermediate model achieved 69.5 mAP with 81% recall; however, incorporating anchor boxes led to 69.2 mAP and 88% recall. Although accuracy slightly decreased, recall's improvement indicates enhanced potential for our model.

翻译:

维集群 。当采用YOLO框架时,在使用锚框的过程中会遇到两个主要问题。首先,在当前实现中,默认选择的是固定尺寸的锚框。然而通过研究发现,在这种情况下模型需要进行大量计算以获得最佳边界框位置。因此我们建议,在实际应用中可以选择更加灵活的锚框设计方案以提高检测精度。其次,在当前框架下由于缺乏有效的监督信号导致边界框定位存在较大偏差。为此我们可以尝试引入一些先验知识来辅助模型更好地理解目标特征进而提升检测效果

We encounter two challenges when using anchor boxes with YOLO. First, the dimensions of these boxes are manually selected. Second, while the network can refine its ability to adjust box sizes appropriately, providing better initial priors will make it easier for the system to accurately predict optimal detection results.

翻译:

我们在训练集边界盒上运行k-means聚类,而不是手动选择先验要找到好的先验。如果我们使用具有欧氏距离的标准k均值,较大的方框比较小的方框产生更多的误差。然而,我们真正想要的是能够获得良好IOU分数的先验,这与盒子的大小无关。因此,对于我们的距离度量我们使用:

As opposed to manually selecting priors, we automatically determine good priors by running k-means clustering on the training set bounding boxes. Employing standard k-means with Euclidean distance results in larger boxes producing higher reconstruction errors compared to smaller ones. However, what we ideally seek are priors that lead to good IOU scores, not affected by the box size. Thus, for our distance metric we use:

图2展示了VOC与COCO数据集下的聚类框维度分析。我们通过对边界框进行k-means聚类分析,生成合适的先验知识,以优化模型性能。左侧图表通过尝试不同k值的数量计算出对应的均值IOU,并发现当k=5时能够最佳平衡召回率与模型复杂度之间的关系。右侧对比图展示了两个数据集在中心位置上的差异,其中两种先验策略都偏好于窄且高的矩形框,而与VOC相比,COCO尺寸的变化更为显著。

Figure 2 demonstrates dimension-based clustering on both VOC and COCO datasets. By applying k-means clustering to bounding box dimensions, we establish effective prior distributions for our model. The left figure illustrates the average IOU across different values of k. Our experiments indicate that setting k to 5 provides an optimal balance between recall and model complexity. The right figure depicts the relative centroids for VOC and COCO datasets, with both prior sets tending towards thinner and taller boxes, where COCO exhibits more size variation compared to VOC.

翻译:

我们采用了多个不同的k值进行k-means算法实验,并通过最近似的质心点计算了每个情况下的平均交并比值(IoU),并将结果展示在图2中。经过分析发现,在模型复杂性和高召回率之间折衷的情况下,默认选择k=5是一个合理的选择。研究发现,在质心点数量较少时(如5个),其表现等同于使用较多锚框(如9个)的情况:当仅设置5个先验时,在表1中人工挑选的锚框基础上计算得到的结果是IoU=61.0;而当我们增加到9个中心点时,则观察到IoU显著提高至62.8左右(具体数值见表格)。这一发现证明了基于k-means算法生成边界框能够显著提升模型对目标物体的表示能力,并进而促进任务学习效果的进步

Conduct k-means clustering for different values of k and visualize the average IOU against the closest centroid, as shown in Figure 2. Selecting k = 5 represents an optimal balance between model complexity and high recall. The cluster centroids are notably distinct from hand-picked anchor boxes by a considerable margin. There is a noticeable disparity with fewer short, wide boxes and more tall, thin boxes. We assess the average IOU against our clustering strategy's closest prior compared to hand-picked anchor boxes (Table 1). At only five priors, centroids achieve comparable performance to nine anchor boxes with an average IOU of 61.0 versus 60.9. When increasing to nine centroids, we observe a marked improvement in average IOU. This underscores that employing k-means to generate bounding box proposals provides an advantageous starting point for model training, facilitating an easier learning process.

如表1所示,在VOC 2007数据集上研究当前最邻近且未修改的对象框(IoU)的表现。在采用不同生成方法之前,在该数据集上的目标(IoU)与它们最邻近且未修改的对象框进行了对比。通过聚类算法获得的结果显著优于基于人工选择初始先验的方法。

Table 1 presents the mean Intersection over Union (IOU) values for bounding boxes relative to the nearest prior boxes in the VOC 2007 dataset. This metric evaluates the performance of object detection models across various approaches, demonstrating how clustering techniques surpass traditional methods that rely on manually curated prior boxes. The clustering method achieves significantly better results compared to approaches that use pre-selected priors.

翻译:

确定位置 。在使用YOLO框架进行目标检测时,在锚框应用过程中会遇到第二个问题:模型稳定性问题尤其在初始训练阶段表现明显。这一问题主要来源于对目标框中心坐标(x, y)的位置预测不准确。在区域建议网络(RPN)中,在滑动窗口遍历图像的过程中,网络将输出位置偏移量tx和ty值,并根据这些偏移量计算出具体的目标框中心坐标(x, y),其计算公式如下:

Direct location prediction**: 当采用带有YOLO的锚框时,在早期迭代阶段会遇到第二个问题:模型不稳定性。大多数不稳定性源自对(x,y)位置坐标的预测。在区域建议网络中,网络预测了tx和ty值,并将中心坐标计算为

翻译:

例如,在tx=+1的情况下,则会将锚框的宽度向右移动;而在tx=-1的情况下,则会将锚框向左移动相同的宽度。该公式无约束条件,在这种情况下使得锚框可以在图像任意位置结束(无论其初始位置如何)。在随机权重初始化的情况下(即未经过特定训练),模型需经过长时间训练方能稳定地估计出可感知范围内的偏移量。我们采用YOLO算法来估计相对于归一化网格单元格的位置坐标(而不是直接估计偏移量),这使得ground truth结果被严格限定在[0, 1]区间内。为了保证网络输出结果在此范围内变动,在损失函数计算过程中采用了逻辑激活函数进行限制。

For instance, when tx is set to 1, it results in shifting an object one grid width unit towards increasing x-coordinate values compared to its original position; conversely, setting tx to -1 moves it one grid width unit towards decreasing x-coordinate values. This particular mathematical formulation ensures that there are no inherent constraints on where an object can be placed within an image since it relies solely on predicted values without considering their spatial context. Training such models with random initialization often necessitates extended periods before they can reliably predict sensible displacement values. To circumvent challenges associated with directly predicting offset displacements, we adopt an alternative strategy inspired by YOLO's approach: instead of estimating absolute offsets, we determine coordinate locations relative to each grid cell's position. This constrains all ground truth targets strictly within [0, 1] interval along both x and y axes. To ensure that predicted values remain within these permissible bounds during training and inference phases, we employ a logistic activation function which maps outputs into this desired range.

翻译:

该网络在输出特征图的每个单元中预测出五个边界框,并对每一个边界框预估了五个参数:tx、ty、tw、th以及to值。其中单元格相对于图像左上角的位置由(cx, cy)表示,在此基础上设定先验框宽度与高度分别为pw与ph,则可得出上述模型的具体参数计算公式。

This system generates five bounding boxes per cell within its output feature map. Each bounding box is assigned five coordinates: tx for horizontal center location; ty for vertical center location; tw representing width; th indicating height; and to serving as an existence indicator. When a cell is displaced from the image's top-left corner by (cx,cy), with a prior bounding box having width pw and height ph,the corresponding predictions are:

翻译:

基于对位置预测施加了限制条件后,在参数化过程中得以更易地完成,并且这种优化方法能够有效增强网络的稳定性。通过对比实验可知,在YOLO框架中采用基于维度集群和直接预测边界框中心位置的方法较之于使用锚框的方式能提升约5%的性能表现。

By limiting the location prediction, parameterizing becomes simpler, enabling a more stable network. By incorporating dimension clusters alongside directly predicting the bounding box center location, we enhance the performance of YOLO by nearly 5% compared to versions that use anchor boxes.

翻译:

该模型具有高分辨率特异

细粒度特征.改进型YOLO网络在13×13分辨率的特征图上实现目标检测。然而,在定位较小物体时,采用更精细粒度的特征可能更具优势。基于捷思算法框架的设计方案则通过在不同尺度特征图上运行提案网络来实现多尺度检测。而我们采用了不同的策略,在网络中增加了一个跳跃连接层(Passthrough Layer),将来自较早层的特征引入到26×26分辨率的特征图中。

翻译:

通过层级结构将相邻特征合并至不同通道而非空间位置,并非直接叠加至单个通道内而是分散至多个通道间以实现信息融合这一操作可将26×26×512大小的输入特征图转换为经过尺寸缩减处理后的13×13×2048大小输出张量从而能够与原始目标检测器进行高效融合运算此操作不仅降低了计算复杂度还能有效提升模型对长距离偏移特性的捕捉能力进而使得整体性能提升约0.5%左右

The pass-through layer combines high-resolution and low-resolution features by arranging adjacent features into separate channels rather than relying on spatial positions, mirroring the identity mappings in ResNet. As a result, it transforms a 26×26×512 feature map into a 13×13×2048 feature map, which seamlessly integrates with original features. The detector operates on this expanded feature map to gain access to detailed features. This modification yields a slight performance improvement of approximately 1%.

翻译:

多尺度训练策略

原文: 可修改后右键重新翻译

Multi-Scale Training. 传统的YOLO架构采用了固定分辨率输入为448×448像素。通过引入锚框机制后,我们将其分辨率调整为416×416像素。值得注意的是,由于我们的模型仅由卷积层和池化层组成,在训练过程中可以根据需要实时调整尺寸。每隔10批数据后,网络会随机选择新的图像尺寸进行调整。具体来说,在每一轮训练周期结束后(每隔一定数量的批次),网络会随机选择新的图像尺寸范围,并从以下可用尺寸中随机选择:{320, 352, ..., 608}像素(间隔为32倍数)。这样做的目的是为了确保模型能够适应不同大小的输入图像,并提升其适应性。

翻译:

该模式引导网络具备跨多种输入维度精准预测的能力。由此可见,在处理不同分辨率的数据时同一个网络可以实现统一性效果。值得注意的是,在处理较小尺寸数据时该算法展现出更快的速度优势从而实现了对速度与准确性之间的权衡效果。当应用于低分辨率场景时YOLOv2作为一种经济高效且精度较高的目标检测器表现出色。具体而言在288×288分辨率下该算法可以在A型算法可以在该分辨率下达到90帧/秒以上的平均精度(mAP)水平与Fast R-CNN不相上下这使其成为适合部署于小规模GPU设备高帧率视频或多路视频流的理想选择。YOLOv2不仅当前是VOC 2007评估中表现最佳的目标检测算法之一而且其78.6幅VOC 2014地图的表现也再次证明了其卓越性能。

该网络在不同输入维度下都能良好地进行预测。
这表明同一网络可以在不同分辨率下进行检测预测。
较小规模的网络运行更快,因此YOLOv2提供了一个在速度和准确性之间容易调节的平衡。
在低分辨率下,YOLOv2作为价格低廉且准确度较高的检测器运行良好。
当设置为^{*}时(此处星号表示具体数值),不仅达到了超过90帧每秒的帧率(FPS),其平均精度(mAP)几乎与Fast R-CNN不相上下。
这使其成为处理小GPU、高帧率视频或多个视频流的理想选择。
见表3对YOLOv2与其它框架在VOC 2007上的比较结果。

翻译:

该表格展示了PASCAL VOC 2007的检测框架。YOLOv2展现出更高的效率和更高的精确度。该系统能够在多种分辨率下运行,并且能够根据需求轻松调节平衡以实现对速度与精度之间的灵活权衡。每个条目实际上都是一个相同的训练模型,在此过程中仅参数大小有所不同。每个条目拥有相同的权重参数,在此过程中仅参数大小有所不同。所有的时间信息都在Ge-force GTX Titan X(原始的)上

Tabellen 3 präsentiert die Detektionsframework für das Pascal VOC 2007. Die Methode YOLOv2 ist nicht nur schneller sondern auch erheblich genauer als alle anderen detektionsmethoden in der Vergangenheit. Es ist in der Lage, verschiedene Auflösungen zu verwenden, um einen einfachen Compromiss zwischen Geschwindigkeit und Genauigkeit herzustellen. Jeder Eintrag von YOLOv2 entspricht tatsächlich dem gleichen trainierten Modell mit denselben Gewichten, nur evaluiert er bei verschiedenen Größen ab. Die Zeitanalyse aller Einträge wird auf einem Geforce GTX Titan X durchgeführt (ursprünglicher Rechner, nicht Pascal-basiert).

翻译:

后续实验部分中

Further Experiments 我们对YOLOv2进行了目标检测方面的训练 并将其应用于VOC 2012数据集 表4展示了YOLOv2与当前最先进的目标检测系统的对比性能 YOLOv2不仅达到了73.4 mAP的平均精度(mAP)而且运行速度远超其他方法 此外 在表5中展示了我们对COCO数据集的测试结果 在VOC指标(IoU=0.5)下 YOLOv2获得了44.0 mAP的成绩 不仅与SSD和Faster R-CNN不相上下

3 更快

翻译:

我们致力于实现检测系统的高精度与高效性。许多现代检测系统主要采用基于视觉计算的应用程序。为了性能最大化的目的,YOLOv2被专为实现快速性能而设计。大多数主流检测框架主要采用了VGG-16作为基础特征提取模块。AlexNet以其强大的分类能力及高的识别精度成为当时的学术界焦点.AlexNet以其强大的分类能力及高的识别精度成为当时的学术界焦点.然而,VGG-16的设计复杂度相对较高,这一特性并非必要.VGG-16由多个卷积层构成,总计需要306.9亿次浮点运算才能完成对分辨率224×224单幅图像的一次完整扫描.

We desire accuracy in our detection system while also prioritizing its speed. The majority of detection applications, including those in robotics and autonomous vehicles, depend on low-latency predictions. To enhance performance, we developed YOLOv2 with a focus on speed from the outset. The majority of detection frameworks utilize VGG-16 as their foundational feature extractor [17]. While VGG-16 is both robust and precise in its classification tasks; however, its intricate architecture introduces unnecessary complexity. The convolutional layers within VGG-16 necessitate 30.69 billion floating-point operations when processing a single image of size 224 × 224 pixels.

翻译:

该框架采用基于GoogleNet架构[19]的一个自定义网络模块。该网络在计算效率方面优于VGG-16架构,在单次前馈过程中仅需85.2亿次运算。然而,在准确度方面略逊于标准的VGG-16架构。对于单张图像而言,在不同尺寸中评估性能时发现最佳精度出现在边长为224像素的图像中。通过该定制模型实现YOLO算法在ImageNet数据集上的平均准确率为88.0%

The YOLO framework employs a custom-designed neural network based on the Googlenet architecture [19]. This self-taught network only requires approximately 8.5 billion computational operations during a forward pass. While this network is faster than its counterpart, VGG-16, its accuracy has marginally lower performance. When tested under single-crop conditions on the ImageNet dataset, YOLO's custom model achieves an 88.0% top-5 validation accuracy when compared to the VGG-16 model's 90.0%.

翻译:

Darknet-19 。提出了一种新的分类模型作为YOLOv2的基础。我们的模型建立在网络设计的前期工作以及该领域的常识之上。与VGG模型类似,我们主要使用3×3的过滤器,在每个池化步骤[17]后通道数量翻倍。根据网络中的网络(NIN)的工作,我们使用全局平均池化来进行预测,并使用1×1滤波器来压缩3×3卷积[9]之间的特征表示。采用批处理归一化的方法来稳定训练,加快收敛速度,并使模型[7]规范化。我们的最后一个模型称为Darknet-19,有19个卷积层和5个maxpooling层。完整的描述见表6。处理一幅图像只需要55.8亿次操作,但在ImageNet上却达到了72.9%的top-1和91.2%的top-5精度。

原文: 可修改后右键重新翻译

我们提出一种新的分类模型作为YOLOv2的基础。该模型在现有研究关于网络架构的基础上进行构建,并结合了该领域的基本知识。与VGG类网络相似,我们模型主要采用3×3卷积核,并在每次池化操作后将通道数量翻倍[17]。借鉴NIN网络的设计理念,在预测阶段我们采用了全局平均池化层,在3×3卷积操作之间插入了1×1卷积层以压缩特征表示[9]。为了提升训练稳定性、加速收敛速度以及施加正则化效应[7],我们采用了批归一化技术。最终构建的 Darknet-19 模型包含19个卷积层和5个最大值池化层(见表6)。该模型仅需约5.58亿次运算即可完成一次图像处理任务,并在ImageNet数据集上实现了令人满意的性能:具体而言,在top-1验证集上获得72.9%的准确率,在top-5验证集上获得91.2%的准确率

翻译:

分类任务 。本研究基于ImageNet 1k数据集进行了网络训练实验,在总共^{,}^{,}经过^{,}^{,}经过^{,}^{,}经过^{,}^{,}经过^{,}^{,}经过^{,}^{,}经过^{,}^{,}经过^{,}^{,}经过^{,}$888个epoch的学习过程中取得了令人满意的实验效果

原文: 可修改后右键重新翻译

Training for classification. We employ a specialized approach to train the network using the standard ImageNet 1000 class classification dataset, spanning 160 epochs, with the stochastic gradient descent algorithm initialized at a starting learning rate of 0.1. Weight decay of 0.0005 and momentum of 0.9 are incorporated alongside a polynomial learning rate decay mechanism characterized by a power value of 4, all configured within the Darknet neural network framework [13]. During the training process, we implement conventional data augmentation techniques such as random cropping, rotation, and hue, saturation, and exposure adjustments. As previously elaborated, after establishing our initial training phase on images sized at 224 × 224 pixels, we proceed to fine-tune our network at an enhanced resolution of 448 pixels. This phase involves maintaining identical parameters—learning rate set to 1 \times 10^{-3}—but conducted over just 10 epochs. At this higher resolution level, our model achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.

翻译:

训练检测

训练检测

原文: 可修改后右键重新翻译

Adjusting this network specifically for detection tasks, we discard the final convolutional layer and instead introduce three additional 3 \times 3 convolutional layers, each containing 1024 filters, followed by a final 1 \times 1 convolutional layer tailored to our detection requirements. For VOC, which requires forecasting up to five boxes each containing five coordinates along with twenty classes per box, we implement a total of 125 filters. Additionally, we incorporate a direct connection from the terminal 3 \times 3 \times 512 feature map to an intermediate convolutional layer, enabling fine-grained feature extraction. Conducting training on this enhanced architecture involves running it through 160 epochs with an initial learning rate set at 10^{-3}, halving it after every 60^{th} and 90^{th} epoch to optimize performance.

翻译:

该模型采用了1\times1卷积层作为特征提取器,并基于反向传播机制实现了高效的梯度计算。在特征提取器的设计中,默认设置了全连接层的数量为128个,在此基础上增加了批归一化层以提升模型性能。为了防止过拟合现象的发生,在全连接层前加入了Dropout层,并将Dropout的概率值设置为0.2

We employ weight decay at 0.0005 and set the momentum to 0.9 during training. The data augmentation technique is analogous to that used in YOLO and SSD, incorporating random crop operations, color shifts, among other methods. Applying the same training protocol across both COCO and VOC datasets ensures consistency in model development.

4 更强

翻译:

为提升模型性能及应用效果,在研究领域中首次提出了一种联合训练机制。该方法采用带有标注的数据集用于目标检测训练,并通过学习目标定位信息及类别识别特征(边界框坐标预测以及目标定位精度),实现对典型物体类别的识别能力提升。同时,在模型训练阶段综合运用目标检测数据集与类别区分任务数据集,在利用带标注样本进行目标识别的同时,在未带标注的目标实例上实现了对不同类别物体特征的学习与提取。在模型处理带有标注样本时,采用YOLOv2损失函数作为优化准则;针对未带标注的目标实例,则仅反向传播架构中分类特定部分的损失以优化模型参数分布结构

We propose a mechanism for integrated training of classification and detection data. Our approach leverages images annotated with detection information to learn bounding box coordinate prediction, objectness, and common object classification. Additionally, it utilizes images with only class labels to enhance its category detection capabilities. During training, we merge images from both detection and classification datasets. When our network processes a detection-annotated image, we employ the full YOLOv2 loss function for backpropagation. For classification images, only the classification-specific components of our architecture contribute to the loss calculation during backpropagation.

翻译:

该方法面临一些挑战。检测数据集仅包含共性标签和通用类别(如"狗"或"船"),而分类数据集则涉及更为细分与广泛涵盖的类别(如ImageNet)。ImageNet中包含数百种狗类具体品种(如Norfolk梗、Yorkshire梗及Beijing梗),这些物种属于同一属下不同种群。在尝试融合两个数据集时会遇到整合问题:大多数分类架构均假设类别间互斥特性,在所有可能类别上均采用softmax层来计算最终概率分布表征。然而此类假设与实际应用存在矛盾:例如若将ImageNet与COCO两类数据库合并训练,则会违背这一前提条件(如Norfolk梗与狗类并非互斥)。因此可采用多标签模型来处理此类非互斥型的数据集合组合问题;但这种做法会忽视我们已知的各类别间存在的结构性特征(例如COCO中的所有类别均为互斥)。

This approach exhibits notable challenges. Detection datasets typically feature limited labels, such as "dog" or "boat". In contrast, classification datasets encompass a broader and more complex spectrum of labels. For instance, ImageNet includes over a hundred distinct dog breeds, including "Norfolk terrier", "Yorkshire terrier", and "Bedlington terrier". To effectively utilize both datasets, a coherent merging strategy is essential. Most classification methodologies employ a softmax layer across all possible categories to determine the final probability distribution. However, this approach assumes that classes are mutually exclusive. This limitation becomes problematic when attempting to merge datasets like ImageNet and COCO, as the breeds "Norfolk terrier" and "dog" share overlapping classifications. An alternative solution involves adopting a multi-label model, which does not require the assumption of mutual exclusivity. Nevertheless, this method disregards the inherent structure of the data, particularly noting that COCO categorizes classes as mutually exclusive.

翻译:

分级分类

层次分类法。ImageNet标签来源于WordNet——一种语言数据库,它通过概念之间的关系构建了知识体系[12]。在WordNet中,“诺福克 terrier”和“约克郡 terrier”都是“terrier”的下级类别(hyponyms),而后者又是“狗”的种类(type of “hunting dog”),进而属于“狗”的种类(type of “dog”),属于犬(canine)等层级关系。尽管大多数分类方法假设标签结构是平缓的(flat),但为了整合数据集所需的数据架构恰恰是我们所需的。由于语言本身的复杂性,在WordNet中结构被表示为有向图而非树状结构——因为存在多义性和分支情况。例如,“狗”既是犬(type of “canine”)又是家养动物(type of “domestic animal”)——两者均为WordNet中的 synsets(同义词集合)。为了避免使用复杂的有向图结构,在本研究中我们从ImageNet的概念中提取出概念层级树状结构来简化问题解决过程。具体而言,在构建该层级树时我们首先分析了ImageNet中的视觉名词,并考察它们在WordNet图中通向根节点(此处为“物理对象”节点)的所有路径。“物理对象”作为根节点,在此语义体系中具有最高的概括性意义[12]。“许多 synsets 只有一条通向根节点的路径”。基于此事实我们首先将所有仅有一条路径到根节点的概念加入到我们的层级树中。“许多 synsets 只有一条通向根节点的路径”。随后我们反复筛选剩余未被纳入层级树的概念并加入那些能够以最小化增加层级树边数的新路径——即优先选择那些只增加少量边就能连接到现有层级树的概念

翻译:

主要产物是WordTree系统实现了概念的层次化表示技术

The outcome is WordTree, a hierarchical model of visual concepts. Using WordTree involves predicting conditional probabilities at each node for the probability of each hyponym within its corresponding synset given that synset exists. For instance, consider the ‘terrier’ node where:

翻译:

为了计算一个特定节点的绝对概率的方法是沿着树到根节点的路径并乘以条件概率

When trying to calculate the absolute probability for a specific node, one merely needs to traverse the path from that node up to the root node and multiply by all conditional probabilities along this path. Thus, when attempting to determine whether an image depicts a Norfolk terrier, one applies this method.

翻译:

在完成分类任务的过程中,在输入图像中存在单一物体的前提下:Pr(physical object)= 1。为此目的,在基于ImageNet 2017数据集构建的层次结构树(WordTree)上训练了darknet-24网络模型作为分类器。具体而言,在构建该层次结构树(WordTree)时引入了所有中间层级节点,并根据需要将嵌入空间维度从D_{in}= 2,488扩展至D_{out}= 4,877以覆盖更多概念类别。在模型训练过程中通过监督学习过程赋予树节点相应的类别标签:例如,在监督学习过程中若输入是一只诺福克梗犬,则其对应的类别也将被赋予"狗"以及"哺乳动物"等子类标签等基础属性信息。在预测阶段:网络会生成一个长度为D_{out}=4,877的概率分布向量;随后计算与每个类别相关的概率分布层的最大后验估计(MAP)结果并取其最大值作为最终置信度评估依据…见图5.

To classify purposes, we posit that an image contains a physical object, such that P r(physical object) equals unity. To verify this approach, we train the Darknet-19 model on WordTree augmented with intermediate nodes, expanding its label space from 1,000 to 1,369. During training, our model propagates ground truth labels upward through hierarchical taxonomies—thus assigning higher-level categories when lower-level ones are assigned. To compute conditional probabilities, our model generates a vector of predictions for each node and applies softmax to these outputs (see Figure 5).

翻译:

图5展示了对ImageNet和WordTree两种模型的预测结果对比。大多数基于ImageNet架构的设计采用了单个全局平均池化层后接大尺寸全连接层以估计概率分布。通过引入WordTree结构,在图像识别任务中实现了多级分类器设计,并将每个层级的概率计算逐步细化以提升分类精度。

图5展示了ImageNet与WordTree之间的比较结果。大部分ImageNet模型采用一个巨大的softmax层来预测概率分布。而采用WordTree方法时,则会执行多个softmax操作以覆盖co-hyponyms关系。

翻译:

基于相同的训练参数设置, 我们的分层算法Darknet-19达成了71.9%的top-1准确率和90.4%的top-5准确率。尽管增加了369个额外的概念, 并使网络能够推断树状结构, 但准确性仅小幅下降。这种分类方法具有良好的效果。在面对新的或未知的对象类别时, 性能表现得非常出色。例如, 当处理一张狗的照片时, 即使不确定具体的品种, 网络依然能够高度自信地将其归类为"狗", 但下垂体之间的信任度较低

By maintaining identical training parameters, our hierarchical Darknet-19 model attains a top-1 validation accuracy of 71.9% and a top-5 validation accuracy of 90.4%. Despite the addition of 369 new concepts and the network's prediction of a tree structure, its accuracy remains nearly unchanged. Such an approach to classification offers several advantages. Performance shows robustness even when encountering novel or previously unseen object categories. For instance, when presented with an image of a dog whose breed is unclear to the system, it would still classify it as 'dog' with high confidence, while distributing lower confidence levels across various breeds.

翻译:

此方案同样适用于目标检测。现在不需假设每张图像必然包含一个物体,而是通过YOLOv2的目标检测器计算Pr(physical object)的概率。探测器不仅识别边界框还利用概率树进行分类。从根节点开始遍历树结构,在每个节点处选择置信度最高的分支继续分析。直到达到预设阈值后确定目标类别。

同样适用于目标检测任务。
现在我们不再假设每张图像都包含一个物体;
现在我们使用YOLOv2的对象存在预测器来计算P_r(physical\ object)的值;
The detector predicts both a bounding box encompassing the target object and a probability tree.
我们遵循这个概率树,在每个节点选择信心分数最高的分支;
直到我们的信心分数低于预定阈值为止;
然后将其归类为此物体类别的实例。

翻译:

数据集与WordTree的融合 。我们可以借助WordTree采用一种系统性的方式对多个数据集进行整合。我们通过将数据集中的类别信息映射到树状结构中的synset节点来实现这一目标(如图6所示)。该技术展示了如何利用WordTree对来自ImageNet和COCO等平台的标签进行有效整合的例子。由于WordNet具备高度多样性特征,因此这种方法适用于大多数场景下的数据分析任务

在图6中展示了WordTree层次结构。基于WordNet知识库的概念模型构建了一个可视化的层次概念树。通过将数据集中的分类与树中的synsets进行映射操作来实现数据集的整合展示。这是一个简化版的WordTree视图示意图

Figure 6: Integrating datasets through a WordTree-based hierarchy structure. By leveraging the WordNet concept graph, we establish a hierarchical structure of visual concepts. Once established, merging datasets involves mapping class identifiers within one dataset to corresponding synsets within the hierarchical structure. The framework presented here serves as an illustrative simplification of the WordTree methodology.

Combination of datasets using WordTree is a method that allows us to sensibly integrate multiple datasets through WordTree. It aligns category labels across datasets by mapping them to corresponding synsets within the tree structure. As illustrated in Figure 6, this method exemplifies how labels from ImageNet and COCO are integrated via WordTree. Thanks to its extensive diversity, this technique remains highly applicable across a wide range of datasets.

翻译:

联合分类与目标检测方面。基于WordTree组合数据集的能力基础之上,我们可以构建一个融合模型,使其用于分类与目标检测任务。为了构建一个庞大规模的目标检测系统,我们计划利用COCO目标检测数据集以及ImageNet中的前9000个类别来生成融合数据集。此外,为了确保系统的全面性,我们需要对现有数据进行扩展,因此我们引入了ImageNet目标检测挑战中尚未涵盖的所有类别。该融合数据集对应的WordTree共有9418个类别,而ImageNet是一个更大程度的数据库,为此我们通过过采样COCO来平衡两者之间的差异,使ImageNet的数据量仅比COCO多出4倍的比例关系。基于此优化后的数据集,我们应用YOLO9000进行目标识别训练工作。具体而言,我们在YOLOv2架构的基础上进行了改进:减少先验框的数量至3个以限制输出尺寸的变化范围。当系统接收到待识别图像时,采用正向传播的方式提取特征并计算损失值;而对于分类损失的具体计算,则仅在标签所属层级及以上层进行反向传播以提高识别精度。例如,当标签为"狗"时,"德国牧羊犬"与"金毛猎犬"这两个子类别的预测结果会被认为是错误预测结果(因为系统并未掌握这些具体信息)。

Joint classification and detection. Now that we can combine datasets using WordTree we can train our joint model on classification and detection. We want to train an extremely large scale detector so we create our combined dataset using the COCO detection dataset and the top 9000 classes from the full ImageNet release. We also need to evaluate our method so we add in any classes from the ImageNet detection challenge that were not already included. The corresponding WordTree for this dataset has 9418 classes. ImageNet is a much larger dataset so we balance the dataset by oversampling COCO so that ImageNet is only larger by a factor of 4:1. Using this dataset we train YOLO9000. We use the base YOLOv2 architecture but only 3 priors instead of 5 to limit the output size. When our network sees a detection image we back propagate loss as normal. For classification loss, we only back propagate loss at or above the corresponding level of the label. For example, if the label is “dog” we do assign any error to predictions further down in the tree, “German Shepherd” versus “Golden Retriever”, because we do not have that information.

翻译:

当我们让模型处理分类图像时

When processing an image containing a classification object, our model focuses solely on the classification loss during backpropagation. To achieve this, we first identify the bounding box that yields the highest probability for the target class, then calculate the loss based solely on its predicted detection tree. Additionally, we assume that there is a minimum overlap (measured by IOU) between the predicted bounding box and what would be considered the ground truth label. Based on this premise, we propagate objectness loss to train our model effectively. Through this joint training approach, YOLO9000 leverages COCO's detection data to identify objects within images and utilizes ImageNet's diverse dataset to classify various types of objects accurately.

翻译:

我们在ImageNet检测任务上评估YOLO9000。ImageNet的检测任务与COCO共享44个对象类别,这意味着YOLO9000只看到了大部分测试图像的分类数据,而没有看到检测数据。YOLO9000对156个未见过任何标记检测数据的不相交对象类得到了19.7mAP和16.0mAP。这个mAP比DPM得到的结果要高,但是YOLO9000是在不同的数据集上训练的,只有部分监督[4]。同时,它还能实时检测9000种其他类别的目标。当我们在ImageNet上分析YOLO9000的表现时,我们发现它能很好地学习新物种,但在学习衣物和设备等类别时却很吃力。

We evaluate YOLO9000 on the ImageNet detection task. The detection task for ImageNet shares on 44 object categories with COCO which means that YOLO9000 has only seen classification data for the majority of the test images, not detection data. YOLO9000 gets 19.7 mAP overall with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data for. This mAP is higher than results achieved by DPM but YOLO9000 is trained on different datasets with only partial supervision [4]. It also is simultaneously detecting 9000 other object categories, all in real-time. When we analyze YOLO9000’s performance on ImageNet we see it learns new species of animals well but struggles with learning categories like clothing and equipment.

翻译:

这些新动物能够更有效地学习是因为它们成功地总结了COCO平台上的动物目标识别任务。与仅标注人类的COCO不同的是,在这种情况下YOLO9000遇到了挑战。

Learning new species is relatively straightforward due to the effective generalization of objectness predictions trained on COCO dataset. In contrast, the COCO dataset lacks bounding box annotations for diverse clothing types; it only provides them for individuals. Consequently, YOLO9000 faces challenges in accurately modeling categories such as sunglasses or swimming trunks.

5 结论

翻译:

我们深入探讨了YOLOv2与YOLO9000等实时目标检测技术的发展历程及其应用前景。其中,在目标检测领域中表现最为卓越的是YOLOv2,在各类标准测试数据集上均展现出超越其他主流算法的优势特点。此外,在图像处理效率方面也实现了突破性进展,在不同分辨率图像上的运行速度与精度均达到了完美的平衡点。而YOLO9000则通过融合分类识别与目标检测两大核心任务,在能够同时处理高达9, 1 8.千个不同类别目标的同时实现了极佳的性能表现。为此我们采用了一种创新的技术架构——WordTree——来整合来自多源数据的数据表示形式,并结合我们的联合优化算法在ImageNet和COCO基准数据集上实现统一训练框架下的最佳性能提升效果。这一突破性进展不仅成功地缩小了目标分类数据集规模与其实际应用能力之间的差距,在多个视觉计算任务中都展现出显著的技术优势价值。此外我们还在技术方法上进行了广泛拓展与深入研究发现:许多创新性研究思路不仅限于目标检测这一应用场景更能广泛应用于其他相关领域从而进一步推动了计算机视觉领域的技术进步与发展

We present YOLOv2 and YOLO9000, two real-time detection frameworks. YOLOv2 stands out as a leading-edge system that consistently outperforms other detection frameworks across diverse datasets in terms of speed. Additionally, it offers flexibility by being deployable at various image resolutions, balancing speed and accuracy seamlessly. YOLO9000 represents a significant advancement in real-time object detection, capable of identifying over 9,000 distinct object categories through joint optimization of detection and classification processes. To achieve this, we employ WordTree to integrate data from multiple sources and utilize our innovative joint optimization technique to train simultaneously on ImageNet and COCO datasets. This approach not only narrows the gap between dataset sizes for detection and classification but also demonstrates the versatility of our methods beyond mere object detection. Our WordTree representation of ImageNet provides a richer output space for image classification tasks. By employing hierarchical classification strategies, we enhance the effectiveness of dataset combination in both classification and segmentation domains. Advanced training techniques such as multi-scale training further improve the system's adaptability across various visual tasks. Looking ahead, we plan to extend these methods to weakly supervised image segmentation. Additionally, we aim to refine our detection results by implementing more sophisticated matching strategies during the training phase for assigning weak labels to classification data. The abundance of labeled data in computer vision continues to inspire innovation; we remain committed to integrating diverse data sources and structures to build stronger models of the visual world.

全部评论 (0)

还没有任何评论哟~