论文翻译：You Only Look Once: Unified, Real-Time Object Detection

阅读量：

摘要

翻译：

该系统提出了一种创新的目标检测方法YOLO。早期的目标检测研究主要依赖于分类器模型。与之前的方法不同，在本研究中我们将目标检测框架转换为一个回归问题。通过特定的分离边界框和相关类别概率进行预测，在本设计中，单个深度神经网络能够直接从输入图像预测所有物体的边界框及其类别概率。由于该系统采用了端到端的设计方案，在优化过程中能够直接考虑整个流程的整体性能。

We introduce YOLO as a novel approach for object detection. Traditional methods for object detection repurpose existing classifiers to perform the task of detecting objects. In contrast, we propose modeling object detection as a regression problem that involves spatially separated bounding boxes along with associated class probabilities. A single neural network is capable of predicting both the bounding boxes and class probabilities directly from full images within a single evaluation process. Notably, since the entire detection pipeline constitutes a unified network structure, it can be optimized end-to-end with direct focus on enhancing its performance in detecting objects.

翻译：

该架构运行效率显著高。
该基础YOLO模型实现每秒45帧的实时图像处理能力。
缩小后的快速YOLO模型能够以155帧/秒的速度高效处理图像，并且其精度是现有实时检测器的两倍。
相较于现有的先进检测系统,该系统在定位准确性上有所不足,但在背景误报方面表现更为稳健。
此外,YOLO通过学习通用物体特征而展现出强大的适应性.将其应用范围拓展至艺术作品等非自然场景后,其性能超越了包括DPM和R-CNN在内的多种检测方法。

Our unified architecture exhibits exceptional performance. The base YOLO model achieves real-time processing at a remarkable 45 frames per second. Its lightweight variant, Fast YOLO, maintains this efficiency while delivering twice the mAP score of other real-time detectors. Compared to current state-of-the-art detection systems, YOLO exhibits slightly lower accuracy in object localization but shows greater reliability in avoiding false positives on background regions. Finally, YOLO demonstrates remarkable capability in learning highly generalized object representations. When extrapolating from natural images to diverse domains such as artwork, it surpasses other detection methodologies like DPM and R-CNN in terms of performance.

1.说明

翻译：

人类瞬间观察图像即可立即识别其中所包含的物体及其位置和互动关系。人类的视觉系统迅速且精确，在几乎无意识的努力下即可完成诸如驾驶等复杂任务。高效可靠的算法让计算机无需额外传感器即可实现汽车自动驾驶，并让辅助设备实时传递场景信息的同时还带来了通用型响应式机器人系统的无限可能性。

Humans perceive an image and immediately recognize the objects within it—knowing their locations and how they interact. The human visual system operates swiftly and accurately, enabling complex tasks such as driving with minimal conscious effort. Swift and precise algorithms for object detection would allow computers to drive cars without relying on specialized sensors; these advancements would enable assistive devices to transmit real-time scene information directly to users. Such progress could pave the way for versatile, adaptive robotic systems capable of responding effectively in various environments.

翻译：

当前的检测系统采用分类器来实现检测。这些系统通过将目标分类，并在不同位置和尺度上进行评估来完成目标识别。例如基于可变形部件模型(DPM)的方法通常采用滑动窗口策略，在均匀间隔的位置对整个图像[10]进行扫描。

现有的检测系统将分类器重新利用以实现检测。为了检测一个物体，这些系统会获取该物体的分类器，并在测试图像的不同位置和尺度上对其进行评估。例如变形部件模型（DPM）采用滑动窗口方法，在整个图像上均匀间隔的位置运行分类器[10]。

翻译：

当前的一些现代方法中包含以下步骤：例如R-CNN采用基于区域的建议方法，在图像中生成潜在边界框；随后对该区域运行分类器进行识别。识别完成后，则会通过后处理步骤细化边界框，并结合场景[13]中的其他目标信息进一步优化边界框。这一系列复杂的过程运行速度较慢，并且难以优化的原因在于每个单独组件都需要独立训练

Recent methods akin to R-CNN employ region proposal techniques to initially identify candidate bounding boxes within an image, followed by classification of these boxes. Following classification, post-processing refines these bounding boxes by eliminating duplicates and re-scoring them based on surrounding objects within the scene [13]. These intricate systems are inefficient and challenging to optimize due to their necessity for independent training of each component.

翻译：

我们重构了目标检测为一个单一的回归问题，并基于图像像素实现边界框坐标及类别概率的预测。我们的系统允许在观察一张图片后即可识别并定位所有存在的物体及其位置。

By reframing the object detection task as a single regression framework, we directly map pixel data from images to the calculation of bounding box coordinates and the estimation of class probabilities. Through our system, you perform a single glance at an image to identify which objects exist in the scene and determine their precise locations (YOLO).

图1:该系统基于YOLO算法设计。该系统采用YOLO算法进行图像处理具有便捷高效的特点。本系统首先通过将输入图像尺寸归一化至 $448\times448$ 实现统一特征提取，在此基础上采用单卷积层完成全图特征提取过程，并通过分析模型输出置信度信息实现目标检测结果的有效筛选。

Figure 1: The YOLO Detection System. Utilizing the YOLO framework, image processing becomes both efficient and effective. Our system (1) scales the input image down to a size of four hundred forty-eight by four hundred forty pixels.

翻译：

YOLO极其简单：参考图1。该卷积网络能够同时预测多个边界框及其对应的类别概率。YOLO通过训练完整图像并直接优化检测性能来实现目标。相较于传统目标检测方法，这种统一模型具有若干优势。

YOLO以其令人耳目一新的简单性著称：请参考图1。单一卷积神经网络同时预测多个边界框及其类别概率。YOLO通过整张图像训练，并直接优化检测性能。该统一模型相比传统物体检测方法具有多项优势：(2)它运行一个单一的卷积神经网络在图像上；(3)并通过模型的置信度对结果检测进行阈值过滤。

翻译：

首先,YOLO展现出极高的速度性能.鉴于我们采用了将帧检测建模为回归问题的方式进行处理,因此无需构建复杂的管道.在实际应用中,在测试阶段仅需对输入图像施以我们的神经网络即可迅速获得预测结果.经测试,基本架构的神经网络实现在单块显卡上的基准测试中达到了45帧/秒的平均处理速度,在无需批处理的情况下进一步优化后可轻松突破150帧/秒的高效率阈值.这表明该系统具备在低于25毫秒的时间窗口内流畅处理视频流的能力.此外,YOLO不仅达到了其他实时系统平均精度水平,其精度表现更是远超同类产品的平均水平.关于我们的系统在webcam上实时运行的演示,请访问我们的项目网页:http://pjreddie.com/yolo/

YOLO operates at a remarkably high speed. By framing detection as a regression task, we avoid the complexity of traditional pipelines. This approach enables us to process images efficiently by running only the neural network during testing. On a Titan X GPU, our base network achieves 45 frames per second without batch processing. A faster variant surpasses 150 frames per second. Consequently, our system processes streaming video in real-time with an impressively low latency of less than 25 milliseconds. Additionally, you can view live demonstrations of our system running in real-time on your webcam at our project webpage: http://pjreddie.com/yolo/.

翻译：

第二，在执行预测任务时，YOLO系统会从全局范围地对所处理的图片内容进行分析。与基于滑动窗口和区域建议器的方法不同，在训练与测试阶段完成全图景感知的任务后,YOLO能够自动提取并整合了各类目标对象在不同位置上的上下文关系及其视觉特征。相比之下,快速R-CNN算法因其仅关注局部区域而被公认为目前最有效的目标检测技术之一[14]。然而,由于其局限性在于无法观察到整体背景细节,该算法容易将图像中的一些非目标区域（如背景）误识别为实际存在的物体。经对比实验表明,YOLO系统在产生误检结果的数量上较之快速R-CNN约有一半左右

Second, YOLO analyzes comprehensively about a picture during its prediction phase. Unlike sliding window and region proposals methods, YOLO examines the entirety of an image both during training and testing so it inherently encodes contextual information regarding classes and their appearances. Fast R-CNN, a leading detection technique [14], erroneously attributes background patches within an image to objects because it lacks the capability to discern broader context. Meanwhile, YOLO's number of background errors is fewer than half those made by Fast R-CNN.

翻译：

第三段：YOLO习得了抽象特征表征的能力。通过分别在自然图像与艺术作品上进行训练与测试,YOLO显著超越了包括DPM、R-CNN等在内的顶级检测算法.由于其高度概括性,当将其应用于新的领域或意外输入时,该系统不易出现崩溃现象.此外,这种特性使其在面对未知领域或异常输入时不崩溃.

Third part highlights that YOLO can learn universal object representations. When trained on natural images and tested on art pieces, YOLO outperforms leading detection methods such as DPM and R-CNN by a significant margin. Due to its high level of universality, it is more likely to remain stable when applied to new domains or unexpected inputs.

翻译：

YOLO相较于最先进的检测系统而言，在检测精度上仍显不足。尽管它能够迅速识别图像中的物体，在精确定位某些物体特别是小型物体方面存在明显困难。我们通过实验深入探讨了这些性能间的权衡关系。所有用于训练与测试的代码均为公开源代码。可从多个来源获取预训练模型。

The performance of YOLO remains significantly behind the latest advancements in object detection technology. Although the method excels at identifying objects within images, there are notable limitations when it comes to accurately localizing specific instances, particularly smaller ones. Our experimental analysis delves deeper into these trade-offs. The entirety of our training and testing code is made accessible under an open-source license, and a range of pre-trained models is readily available for download.

2 统一检测

翻译：

我们整合了目标检测的各项元素进一个统一的神经网络架构里。该网络通过整合整个图像的所有特征信息来定位每一个边界框。此外，该系统能够同时识别并定位所有类别中的边界框。这表明该模型在全局层面上分析整个图像，并识别其中的所有物体。采用YOLO架构使得模型既能够在端到端训练中表现优异，又能在实际应用中实现较高的处理速度。

We integrate the various aspects of object detection into a unified neural framework. Our system leverages comprehensive feature extraction from entire images to precisely locate bounding boxes for individual objects. Additionally, it anticipates concurrent bounding box predictions across diverse object classes within a single image context. This signifies that our framework significantly considers global contextual information about images and their constituent objects simultaneously. The YOLO architecture is specifically designed to facilitate end-to-end trainable models while achieving real-time inference speeds without compromising on accuracy metrics like average precision.

翻译：

我们的系统通过划分输入图像为S×S网格来执行任务。当一个物体的中心落入某个网格单元时，则由该单元负责对该物体进行检测。

Our system partitions the input image into a grid of S × S cells. When the center of an object lies within a grid cell, that specific cell is assigned responsibility for detecting the object.

翻译：

每个网格单元估计其内部存在物体的B类型的边界框及其置信值。这些置信分数不仅体现了模型对该区域存在物体的信心程度还反映了该区域边界的预测准确性。在形式上我们定义了这种情况下边界的置信度

当单元格中无对象时，则置信度分数应设为零；否则，则要求置信分数等于预测框与真实框之间IOU的值

The grid cell predicts B bounding boxes along with their corresponding confidence scores for each box. The model's confidence in each box's association with an object is reflected alongside its assessment of how precise each prediction is. Formally, we define a measure called confidence, calculated as Pr(Object) multiplied by IOU between truth and predicted boxes. If no object exists within a grid cell, all associated confidence scores are set to zero. Otherwise, we aim for each cell's confidence score to match exactly...confidence...between its predicted bounding box and ground truth location.

翻译：

每个边界框包含五个预测：x,y,w,h和置信度值。(x,y)坐标位于网格单元格边界的中间位置。宽度和高度的预测基于整个图像尺寸。而置信度值则表示该预测边界框与实际边界框之间的交并比（IoU）。

Each bounding box comprises five predictions: x, y, w, h, and confidence scores. The (x,y) coordinates indicate where within a grid cell a bounding box is centered. Dimensions w and h are measured relative to the entire image size. Finally, confidence scores quantify how well each predicted bounding box matches corresponding ground truth boxes in terms of intersection over union (IOU).

翻译：

每个网格单元格还预测C个条件类别概率

这些概率值基于包含一个目标物体的网格单元划分。不论边界框B的数量如何变化，我们仅能对每个网格单元格预测一组类别概率。

Each grid cell also predicts C conditional class probabilities, denoted as Pr(Classi|Object). These probabilities are determined based on the presence of an object within the grid cell. Notably, only a single set of class probabilities is generated for each individual grid cell, irrespective of the total number of bounding boxes B.

翻译：

在测试时，我们将条件类概率和每个框置信度预测相乘，

During the testing phase, we compute the probabilities of class labels conditioned on observed features and calculate confidence scores for each individual box prediction.

翻译：

为每个盒子特定类提供了信心分数。这些分数表示该类出现在方框中的概率以及预测的方框与对象的匹配程度

which method generates category-specific confidence scores for each box. These scores not only indicate the probability of a particular category existing within a box but also assess how well the predicted bounding box aligns with the object.

翻译：

图2: 模型。该系统中的模型检测模块被定义为一种回归任务。该模块通过对输入图像进行划分操作来实现目标识别功能：具体而言，在图像上建立了一个(S × S)的网格结构，在每一个网格单元中执行(B × 5 + C)维度的特征提取过程，并通过构建一个三维(S × S × (B × 5 + C))张量来整合所有预测结果

Figure 2: Model Overview. Our system models detection as a regression problem, dividing the input image into an S×S grid. Each grid cell then predicts B bounding boxes along with their confidence scores and class probability distributions, which collectively form an S×S×(5B+C) dimensional tensor to encapsulate all predictions.

翻译：

基于PASCAL VOC平台计算YOLO算法的结果，在其中设置参数S为7、B为2。由于PASCAL VOC共有20个类别标签，则确定C值为20。最终输出结果对应一个维度大小为7×7×30的空间分布特征图。

To assess the performance of YOLO on the PASCAL VOC dataset, we employ S=7 and B=2. The number of labeled categories in the PASCAL VOC dataset is C=20. The resulting prediction output constitutes a tensor with dimensions of size seven by seven by thirty.

2.1 网络设计

翻译：

采用该模型作为卷积神经网络进行构建，并在PASCAL VOC检测数据集[9]上进行经过评估。其中，初始卷积层负责从输入图像中提取特征信息；而全连接层则通过计算预测输出的概率值及其对应的位置坐标。

We realize this model as a convolutional neural network and assess its performance on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract image features, while performing feature extraction. The fully connected layers then compute output probabilities and coordinate predictions based on these features.

翻译：

图3:架构。我们的检测网络由多通道卷积模块构成，并包含两个全连接模块。交错使用1×1尺寸的卷积模块来缩减前一层的空间维度。为了适应不同尺度的需求，在半分辨率（基于输入为固定在 $[H_{in} \times W_{in}] = [288 \times 288]$ 上的图像）的情况下完成预训练任务之后，则提升到双倍分辨率用于目标检测

Please refer to Figure 3 for an overview of our architecture. The detection network comprises a total of twenty-four convolutional blocks, each followed by two fully connected layers. By interleaving one-by-one convolutional modules, we diminish the feature space incrementally compared to preceding stages. The initial stages are pre-trained on a half-resolution ImageNet task using a standard input size of (224×224 pixels), with subsequent stages trained at full resolution to enhance detection accuracy.

翻译：

图3: 架构。我们的检测网络包含24个卷积层和随后的2个全连接层。通过交替配置的1×1卷积层来降低前面各层所形成的特征空间维度。我们采用半分辨率（即224×224输入图片）对ImageNet分类任务相关的卷积层进行预训练，并在随后提高分辨率（至双倍）以实现检测任务

原文: 可修改后右键重新翻译

Figure 3: The Architecture. Our detection network consists of 24 convolutional layers, which are then followed by 2 fully connected layers. Interleaved with 1 × 1 convolutional layers, the feature space is diminished within the preceding layer’s feature space. We pretrain these convolutional layers on the ImageNet classification task using input images of half resolution (224 × 224), and subsequently increase their resolution for detection tasks.

翻译：

该网络架构借鉴自用于图像分类研究中的GoogleNet模型[33]。该网络由24个卷积层和随后紧接着排列出的两个全连接层构成。与GoogleNet所采用的不同在于本研究仅采用了1×1缩减层，并在此之后才接续着应用了一个大小为 3×3 的卷积操作类似Lin等[22]所述的方法。整个系统的完整架构如图 3 所示

Our network architecture draws inspiration from the GoogLeNet model designed for image classification tasks [33]. The proposed structure comprises a total of 24 convolutional layers, which are followed by 2 fully connected layers. To achieve the desired functionality without resorting to Inception modules, we have implemented a simplified design utilizing 1×1 reduction layers preceeded by 3×3 convolutional layers, akin to the approach taken by Lin et al. in their work [22]. As illustrated in Figure 3, the complete architecture is depicted.

翻译：

开发了一款经过优化的YOLO快速检测工具以显著提升物体检测速度为目标

Additionally, we develop a more efficient version of YOLO aimed at advancing the state-of-the-art in rapid object detection. The improved version, known as Fast YOLO, employs a neural network structure that significantly reduces computational complexity by decreasing both the number of convolutional layers (from 24 to just 9) and the number of filters within each layer. While maintaining identical training and testing configurations outside this optimization, both versions share nearly identical performance metrics when evaluated on standard benchmarks.

翻译：

我们网络的最终输出是预测的7×7×30张量。

The final output of our network is the 7 × 7 × 30 tensor of predictions.

2.2 训练

翻译：

基于ImageNet 1000级分类数据集[29]进行卷积层预训练，在图3中选取了前20个卷积层，并配置了一个平均池化层以及一个全连接层。经过约一周的时间进行系统性训练后，在ImageNet 2012验证集中达到了88%的单物种种群前5精度，并与Caffe模型库中GoogLeNet基准模型相当。

We pre-trained our convolutional layers on the ImageNet 1,000-class dataset [29]. The training process utilized the first twenty convolutional layers from Figure 3, which are followed by an average-pooling layer and a fully connected layer. The network was trained for approximately one week, achieving an accuracy of 88% with single-crop top-5 results on the ImageNet 2012 validation set. This performance is comparable to that of GoogLeNet models in Caffe's Model Zoo [24].

翻译：

随后我们将模型进行了迁移学习，并进行了检测。研究表明，在预训练的基础上增加卷积模块和全连接层能有效提升性能。参考其方法后继增加了四层卷积模块以及两组带随机权重的全连接层，并最终达到了更好的分类效果。为了捕捉更细致的视觉特征从而提升了网络的整体性能表现效果上我们提高了输入分辨率至448×448

We then transform the model to implement object detection. Ren et al. demonstrate that incorporating both convolutional and fully connected layers into pretrained networks can enhance performance [28]. Leveraging their methodology, our architecture incorporates four convolutional stages and two fully connected modules with randomized initialization weights. Object detection necessitates capturing fine visual details, prompting an increase in network input resolution from a standard 224×224 configuration to a larger 448×448 setup.

翻译：

最后几层预测类的概率置信度与边界框的位置信息（坐标）。通过对图像整体尺寸的比例缩放处理后得到；归一化处理后这些值均落在[0,1]区间内。将每个边框的位置参数化为相对于特定网格单元格中心的位置偏移量；从而这些相对位置坐标的取值范围也限定在[0,1]区间内

原文: 可修改后右键重新翻译

Our method predicts both class probabilities and bounding box coordinates. By normalizing the bounding box width and height relative to image dimensions, we ensure their values lie within [0,1]. To parameterize the bounding box x,y coordinates, we use offsets from a specific grid cell location, keeping them within [0,1] as well.

翻译：

我们采用全零点线性单元（ReLU）对最终一层进行处理，并对其余所有层统一采用以下的带泄露的rectified linear activation（Leaky ReLU）。

We employ a linear activation function type for the final computational layer, and the remaining hidden layers utilize the following leaky rectified linear activation.

翻译：

我们通过优化平方和误差这一指标来提升模型性能。尽管平方和误差具有易于优化的特点，然而这一指标却无法完全满足我们追求的最大化平均精度的目标。在计算时采用相同权重的定位误差与分类误差在某些情况下表现并不令人满意。在每张图像中存在大量网格单元格不具备任何对象，在这些单元格中置信分数因此被降至零水平，并且其梯度变化通常显著超过包含对象单元格。这可能会影响模型的整体稳定性，并且会导致训练初期阶段即出现明显的分歧。

Our model aims to minimize the squared error in its output. The squared error serves as a convenient metric for optimization; however, it doesn't fully capture our objective of maximizing average precision. This approach treats localization and classification errors equally, which might not be optimal. Additionally, each image contains numerous grid cells that lack any object; these cells' confidence scores are driven down toward zero due to their lack of content. Such situations can result in model instability and premature divergence during training.

翻译：

为了使这一目标得以实现，
我们提升了边界框坐标的预测误差，
同时降低了对不含目标物体框的信心预测误差。
我们使用两种不同的模型参数来完成这一任务，
即位置编码型和特征表示型模型参数。
我们将它们设定为特定值。

In order to address this issue, we adjust the losses associated with bounding box coordinate predictions by increasing them and those related to confidence predictions by decreasing them for boxes that do not contain objects. Two variables, $\lambda_{\text{coord}}$ and $\lambda_{\text{noobj}}$ , are employed to achieve this. The values of these variables are set as follows: $\lambda_{\text{coord}}$ is assigned a value of 5, while $\lambda_{\text{noobj}}$ is assigned a value of 0.5.

翻译：

Sum of squared errors assigns equivalent weights to both large and small boxes in terms of their respective errors. Our error metric should emphasize that small deviations within large boxes are more critical than those within smaller boxes. To partially address this issue, we opt to take the square root of the bounding box width and height rather than using the raw measurements.

The sum-of-squared error assigns equal importance to errors within both large and small boxes. Our chosen error metric emphasizes that deviations of smaller magnitude in larger boxes are deemed less critical compared to those in smaller boxes. To mitigate this issue, we opt to predict the square root of the bounding box's width and height rather than their raw dimensions.

翻译：

我们为每个目标类型指定了一个独立的边界框预测器。在训练过程中，每个预测器将专注于一个特定的目标类别，并根据其与真实目标之间的匹配程度来确定任务分配。具体而言，在对某一小目标进行预测时，默认会将该小目标归类到与其特征最匹配的目标类别中。这种策略促进了各边界框预测器在特定任务上的专业化发展。

YOLO is capable of predicting multiple bounding boxes within each grid cell. During training, we aim for each object to be assigned responsibility by a single bounding box predictor. During training, each bounding box predictor is assigned responsibility for an object based on its prediction achieving the highest IOU with the corresponding ground truth annotation. Such assignment fosters specialization among the various bounding box predictors. Each predictor becomes increasingly adept at accurately estimating objects of specific sizes, aspect ratios, or classes, thereby enhancing overall recall performance.

翻译：

在培训过程中，我们优化了以下，多部分损失函数:

During training we optimize the following, multi-part loss function:

翻译：

在

表示如果对象出现在网格i和

表示在网格i中的第j个边界框预测器负责目标预测。

where

when an object is present in cell i, ij represents the responsibility of the jth bounding box predictor in cell i.

翻译：

当一个物体位于网格单元格内的某个单元时，在这种情况下，损失函数仅仅针对分类错误进行惩罚（这与之前讨论过的条件类概率有关）。如果该预测器对真实框具有最高的IOU值（即，在该网格单元中），它也会仅仅对边界框坐标的误差进行惩罚。

Note that the loss function only imposes classification error penalties when an object exists within a grid cell, which ties into the conditional class probability discussed earlier. It also applies bounding box coordinate penalties only to those predictors designated as responsible for the ground truth bounding box, specifically those achieving the highest IOU among all predictors within their respective grid cell.

翻译：

基于PASCAL VOC 2012年的系统框架，在该框架下利用了来自VOC 2017的数据集进行了一段时间的持续性强化学习。在评估阶段，则采用了VOC 2017的数据集作为辅助材料。在模型优化的过程中采用了64批大小的学习策略，并设置了动量因子为9×1e-3、权重衰减系数为5e-4。

During training, our network is trained for approximately $135$ epochs using data from both PASCAL VOC $2\text{K}+7$ and $VOC~\text{train}+val$ datasets to ensure consistent evaluation across years. Note that during evaluation in year $VOC~\text{test}+train$ (i.e., year $VOC~\text{year}_test + train)$ ), we incorporate the VOC_test+train test set into our evaluation fold to ensure fair comparison across different folds. During the optimization process, we employ an Adam optimizer with learning rate $\eta=3e^{-4}$ (initially) decaying at step sizes $\{\lambda: \lambda/3\}$ whenever loss plateaus; however, in this work we only run optimization for $N$ epochs (or until convergence). During each epoch, we use a batch size of $64$ , set $\beta_1= ~ \beta_1$ , apply gradient clipping at maximum norm $g_{max}=5$ , use weight decay $\lambda_wd= ~ \lambda_wd$ , enable label smoothing with $\epsilon= ~ \epsilon$ , maintain gradient clipping after accumulation steps $s= ~ s$ , initialize weights via He normal initialization ( $mode=\text{fan\_in}$ ), bias terms via zero initialization ( $mode=\text{fan\_out}$ ), enable gradient checkpointing ( $c=\text{True}/False$ ), enable mixed precision arithmetic ( $mp=\text{True}/False$ ), or disable gradient scaling ($gs=\text{True}/False}) as per hyperparameter search results; however, in this work all hyperparameters are fixed as per initial configuration. We evaluate our approach using multiple metrics including accuracy (A), precision (P), recall (R), F-score (F) as well as AUC metrics across different operating points (AUC-TPR@FPR-...).

翻译：

以下是我们的学习速率安排：
在第一个阶段中,我们将学习速率逐步提升至 $1\times10^{-2}$ 。
采用较高的初始学习率可能会导致模型出现不稳定现象并最终发散。
在此之后,我们在第76-149个周期内使用 $1\times10^{-3}$ 的学习率,并在最后的30个周期中则采用较小的学习率 $1\times10^{-4}$ 。

按照以下方式设定学习率：在最初的阶段，我们逐步将学习率从 $1\text{e}-\text{3}$ 提升至 $1\text{e}-\text{2}$ 。当初始设置较高的学习率时，模型可能会因梯度不稳定而发散。随后，在第76到第149个周期中使用了较低的学习率（即 $1\text{e}-\text{2}$ ）；接着在第 $76$ 个周期开始时降低到更低的学习率（即 $1\text{e}-\text{3}$ ），并持续训练约 $75$ 个周期；最后再降低到最低的学习率（即 $1\text{e}-\text{4}$ ），继续训练约 $75$ 个周期。

翻译：

为防止模型过度拟合，在构建网络时采用了Dropout技术并进行了充分的数据增强。通过在各连接层之间施加抑制相邻层级间的适应性机制，在第一连接层设置了速率= 5的Dropout层以进一步减少模型复杂度。为了提高训练数据多样性，在生成训练集时引入了原始图像尺寸20%范围内的随机缩放和平移操作。此外，在HSV色彩空间中实现了光线亮度和对比度的动态调节，并将最大调节幅度设定为1.5倍以确保图像质量得到妥善维护。

To mitigate overfitting, we employ dropout layers combined with extensive data augmentation techniques. A dropout layer with a rate of 0.5, positioned after the first fully connected layer, helps prevent co-adaptation among different layers [18]. For enhancing robustness through data augmentation, we implement random scaling and horizontal translations, each varying by up to 20% of the original image dimensions. Additionally, we adjust the exposure and saturation within a factor of 1.5 in the HSV color space.

2.3 推论

翻译：

与训练阶段相同，在测试阶段对图像进行预测检测任务时也只需执行一次网络评估操作。实验结果表明，在PASCAL VOC数据集上实现每张图片被识别出98个边界框及其所属类别概率。值得注意的是YOLO算法在测试过程中特别迅速，在每次测试中只需运行单次网络推断运算量；相比之下，在基于特征标识符的方法中则需要多次这样的计算步骤。

Similar to training procedures, foreseeing detections for a test image only involves one network evaluation. Under the PASCAL VOC dataset, the network yields 98 bounding boxes per image along with class probabilities for each box. Compared to other methods relying on multiple evaluations during testing, YOLO stands out by achieving this efficiency through a single network evaluation.

翻译：

网格设计增强了边界盒预测的空间多样性。

The grid design ensures spatial diversity across bounding box predictions. Typically, it's straightforward to determine which grid cell an object occupies, with the network predicting a single box per object. While this approach works well for most cases, larger objects or those near cell edges may benefit from multi-cell localization. The method of non-maximal suppression helps address these issues. Though its impact on performance is minimal compared to R-CNN or DPM, this technique improves mAP by 23%.

2.4 YOLOv3的限制

翻译：

该算法对边界框的预测施加了很强的空间上的限制。由于每个网格单元格仅能预测两个框，并且仅能对应一个类别，在此条件下形成了严格的定位规范。这种空间上的限制导致了我们模型能够识别附近区域物体数量的能力受到制约。为了应对这些成群出现的小物体（如鸟群），我们的模型展开了一系列检测任务。

YOLO enforces strong spatial constraints on bounding box predictions due to each grid cell's limitation of predicting two boxes and a single class. Such constraints restrict our model's capability to detect multiple nearby objects. Our model encounters challenges in identifying small objects grouped together, for instance, flocks of birds.

翻译：

由于该模型通过训练能够从数据中推断出边界框的位置, 因此在面对具有不寻常比例或独特配置的对象时, 该模型表现出较低的适应性。另一方面, 该模型采用较为简化的特征表示用于边界框预测, 其主要原因在于这一架构设计使得在处理输入图像时有多次降维的过程

Since our model is trained to predict bounding boxes based on data, it encounters challenges in effectively handling objects with different or atypical proportions and configurations. Additionally, due to the presence of multiple downsampling layers within our architecture, the model tends to employ simple descriptors when predicting bounding boxes.

翻译：

最后，在训练用于近似检测性能的损失函数时，在小边界框与大边界框中对误差进行处理的方式一致。在一个较大的框中出现的小误差通常不会带来显著问题；然而，在较小框中出现的小误差会对文件产生更为显著的影响。导致我们主要错误源的原因是不准确的本地化定位。

Finally, as we train using a loss function designed to approximate detection performance, our loss function treats errors equivalently in both small and large bounding boxes. While a minor mistake within a larger bounding box is typically benign, such an oversight within a smaller box can significantly impact the IOU metric. Our primary source of error stems from inaccurate localizations.

3 与其他检测系统的比较

翻译：

目标检测被视为计算机视觉领域的关键议题。通常情况下，在进行目标检测时会先从输入图像中提取一系列稳健的特征集合（包括Haar特征、SIFT特征、HOG特征以及卷积神经网络提取的深度学习特征）。随后将这些分类器与本地化器应用于特征空间以识别目标。这些识别机制通常有两种实现方式：一种是通过在整个图像上滑动窗口的方式进行扫描；另一种则是在特定的图像区域局部进行分析。我们对比分析了YOLO目标检测系统与其他主流的目标检测框架，并着重探讨了它们在性能和应用场景上的差异性

Object detection is a central issue in computer vision. Typically by extracting robust features from input images (Haar[25], SIFT[23], HOG[4], convolutional features[6]), detection pipelines begin. Employed for identifying objects within the feature space are classifiers[...] or localizers[...]. These classifiers or localizers operate either through sliding window methods across the entire image or focus on specific regions. We evaluate our YOLO system against leading-edge detection frameworks.

翻译：

具有可变形特性的零件模型（DPM）采用滑动窗口方法执行[10]目标检测任务。DPM通过不相交管道提取静态特征，并识别出分类区域的同时预测高得分区域的边界框这一系列操作。我们所开发的系统替代了其中的所有组件，并整合了一个多任务处理框架：该框架同时完成特征提取、边界盒预测以及非最大值抑制和上下文推理的任务；与静止特性不同的是，在线训练特性并为检测任务进行了优化；我们的统一架构相比传统DPM不仅运算速度快而且精确度更高

可变形部分模型（DPM）采用滑动窗口方法进行目标检测[10]。该系统通过不连贯的管道提取静态特征并分类区域；随后预测高评分区域的边界框；之后通过非最大值抑制和上下文推理进一步优化结果。我们的系统将所有这些不同的组件用一个卷积神经网络统一替代；该网络不仅同时执行特征提取、边界框预测、非最大值抑制以及上下文推理等功能；而且能够实时在线训练并优化这些特征以适应检测任务的需求；最终我们采用统一架构使得模型运行更快且准确性更高

翻译：

R-CNN及其变体采用区域 proposals而非滑动窗口以定位图像中的目标。基于选择性搜索算法[34]生成了一系列候选边界框。卷积神经网络通过提取图像特征来捕捉关键信息。支持向量机模型对每个候选边界框进行了性能评估。线性回归模型则通过最小二乘法优化边界框的位置参数。该系统由多个独立的优化阶段组成，在训练阶段需精准调优每个模块参数才能实现稳定运行。在测试时间[14]下运行时的表现不佳，在单个测试图像上耗时超过40秒。

R-CNN.R-CNN and its variants utilize region-based proposals rather than sliding window mechanisms to detect objects within images. Selective Search [34] produces candidate bounding box regions; a convolutional network extracts feature representations; The SVM assigns scores to these regions; A linear regression model refines these regions; The non-maximum suppression step prunes redundant detections. The tuning of each component within this intricate framework requires meticulous optimization efforts; The overall system exhibits significant computational overhead, with processing times exceeding 40 seconds per image during testing [14].

翻译：

YOLO与R-CNN在某些方面存在相似性。每个网格单元会提出潜在的边界框，并通过卷积特征对其进行评分。然而，在网格单元方案的设计中，默认设置了一个空间限制条件。我们系统的边界框数量显著减少，在每张图像中仅为98个；相比之下，选择性搜索的数量大约为2000个左右。最后阶段的任务是将所有这些独立组件整合到一个统一且协同优化的整体框架中。

YOLO shares notable similarities with R-CNN. Each grid cell proposes potential bounding box candidates and calculates their scores based on convolutional features extracted from the input images. However, our system imposes spatial constraints on grid cell proposals to reduce redundant detection of the same object within an image. Our system significantly reduces the number of proposed bounding boxes by a factor of approximately twenty compared to Selective Search's ~2000 per image. Finally, our system integrates these components into a single end-to-end optimized framework that achieves superior performance.

翻译：

其他快速检测器Fast及Faster R-CNN主要通过共享计算机制及神经网络辅助手段来提升R-CNN框架的运行效率，并非依赖于选择性搜索方法[14][27]。尽管它们在速度与准确性方面有所提升，并未达到实时处理的要求。

Various faster implementations aim to accelerate the original r-cnn framework through shared computational resources, leveraging neural networks for region proposals rather than relying on selective search [14][27]. While these methods have improved both in speed and accuracy over the original r-cnn, their ability to achieve real-time performance remains lacking.

翻译：

许多研究致力于提升DPM管道的速度[30][37][5]。这些研究通过级联的方式加快了HOG计算的速度，并将计算任务转移至GPU上处理以提高效率。值得注意的是，在实时应用中表现最佳的DPM版本仅为每秒处理约30个帧（PS: 1Hz=1帧/秒）。

numerous research efforts center around accelerating the DPM pipeline, citing studies [30][37][5]. These advancements include utilizing cascaded algorithms to enhance HOG computation and transferring computational tasks onto GPUs. Yet, despite these optimizations, only the 30Hz DPM variant achieves real-time performance.

翻译：

YOLO未对大型检测管道中的单个组件进行优化处理，而是整体释放该管道，并在设计层面表现出较高的效率。

Rather than aiming to optimize individual components of a large detection pipeline, YOLO chooses to throw out the entire pipeline, designed for speed.

翻译：

对于仅限于单一类别（如人脸或人）的人脸检测器来说，在性能上可以经过高度优化。这是因为这类检测器只需处理较少数量的变化特征。YOLO代表了一种通用目标探测器，在此框架下能够识别并定位多种不同类型的物体。

Detectors targeting single classes, such as faces or people, are capable of being highly optimized as they must address a significantly reduced level of variation [36]. YOLO is a general-purpose detector designed to simultaneously identify various objects.

翻译：

在深入的位置上进行了研究与探索

该方法基于深度学习实现多目标框预测（Deep MultiBox）。与R-CNN不同的是，在Szegedy等人[8]的研究中使用卷积神经网络（CNN）代替了Selective Search算法来进行区域检测（region prediction）。此外，MultiBox还可以通过替换置信度预测来实现单物体检测（single object detection）。然而，MultiBox无法完成通用物体检测（general object detection），其功能仅作为整体检测框架中的一个组件存在（larger detection pipeline），因此需要进一步的图像分块分类（image patch classification）以完成完整的检测流程。YOLO和MultiBox均采用卷积网络预测图像中的边界框（bounding boxes），但YOLO作为一个完整的检测系统（detection system）具备更高的自动化能力。

翻译：

OverFeat是一种由Sermanet等人开发的卷积神经网络（CNN）架构。该系统通过训练深度神经网络来实现目标定位，并进一步优化定位精度以实现精确检测[31]。与传统的独立定位和检测分离架构不同，在优化定位性能方面OverFeat表现出色。与DPM（Dynamic Part Model）类似，在做预测时这些局部信息提取器仅关注局部区域内的特征信息。然而由于缺乏对全局语义的理解能力OverFeat必须依赖于大量复杂的后处理计算步骤才能生成可靠的检测结果

The system described by OverFeat, as detailed in Sermanet et al.'s research, involves training a convolutional neural network specifically for the purpose of localization and subsequently adapting this localized model for detection tasks [31]. While OverFeat efficiently conducts sliding window detection, this approach remains somewhat fragmented as it does not integrate global context. The system prioritizes localization accuracy over detection performance, mirroring the design principles of Discriminatively Trained Part-based Object Detectors (DPM). Similar to DPM, the localizer within this framework operates under the constraint of limited contextual awareness, relying solely on local information during predictions. Consequently, OverFeat necessitates extensive post-processing steps to synthesize coherent detections from these localized analyses.

翻译：

我们的工作在设计方面与Redmon等[26]在抓取检测方面的工作相似。我们采用的边界框预测网格方法基于多抓取系统进行回归预测。然而这种任务比物体检测简单得多即这种任务只需对单个目标图像进行处理即可。多抓取系统仅需预测一个适合被抓取的目标区域而无需估计物体的尺寸位置或边界信息也不需对目标类别进行分类只需找到一个适合被抓取的目标区域即可。YOLO则能够同时识别并定位图像中的多个类别物体及其边界框

我们的研究在设计上与Redmon等人针对抓取检测的工作具有相似之处[26]。我们采用网格方法进行边界框预测的方法建立在MultiGrasp系统的基础上。然而, 抓取检测要比物体检测更为简单. MultiGrasp系统仅需识别出一个适合抓取的区域,适用于单一物体图像.它无需估计物体的尺寸、位置或边界,也无需预测其类别,只需识别出一个适合抓取的区域即可. YOLO则能够同时预测多个物体多个类别的边界框及其概率.

4 实验

翻译：

我们首先对YOLO与其他实时检测系统在PASCAL VOC 2007的数据集进行了对比分析。为了探究YOLO与R-CNN变体之间的性能差异,我们不仅研究了YOLO与Fast R-CNN（其中R-CNN[14]是其性能最强的变体）在VOC 2007数据集上的分类错误表现,还探讨了两者在不同场景下的实际应用效果。通过分析不同错误情况下的检测效果,我们的实验表明,在减少背景假阳性错误的前提下,YOLO能够有效提升R-CNN的检测性能。最后,我们在两个艺术风格数据集上进行实验验证后发现,YOLO在新领域适应性方面的优势显著优于其他主流检测算法。

We first compare YOLO with other real-time detection systems on PASCAL VOC 2007. To investigate the differences between YOLO and R-CNN variants, we analyze the errors made by YOLO and Fast R-CNN in VOC 2007 [14]. Based on the distinct error profiles, we demonstrate that YOLO can be utilized to regrade Fast R-CNN detections, effectively reducing errors stemming from background false positives and achieving a notable performance improvement. Additionally, we present VOC 2012 results and conduct a comparison of mAP scores against current state-of-the-art methods. Finally, we highlight that YOLO demonstrates superior generalization capabilities in new domains compared to other detection methods across two artwork datasets.

4.1 与其他实时系统的比较

翻译：

许多目标检测领域的研究都致力于高效构建通用评估框架。然而，仅限于Sadeghi等人的工作才真正实现了实时性（每秒30帧或更高）。我们对YOLO与他们提出的DPM方法在GPU上的实现进行了性能对比，在30赫兹至100赫兹之间观察到了显著差异。尽管其他研究尚未达到实时性的关键里程碑点位，但我们也对他们的相对平均精度（mAP）与速度进行了深入分析。

Many research efforts in object detection aim to enhance the speed of standard detection pipelines. However, only a few researchers, specifically Sadeghi et al., have successfully developed a real-time detection system capable of processing at least 30 frames per second (as reported in reference [30]). In this study, we conduct a comparative analysis between YOLO and the GPU-based DPM implementation by Sadeghi et al., which operates at either 30Hz or 100Hz. While other research efforts have not yet achieved this milestone, we also evaluate their relative performance metrics, such as mean average precision (mAP) and computational efficiency, to explore the trade-offs between accuracy and performance inherent in object detection systems.

翻译：

leading-edge YOLO算法在PASCAL VOC挑战赛中被公认为当前最优秀的目标检测算法；据公开资料显示，在现有技术中它具有无可匹敌的性能优势。该算法在52.7%的平均精度水平上实现了显著提升（相比之前的实时检测方法），其mAP值达到了当前最高水平的同时也维持了良好的实时性能。

本系统是PASCAL上最快的对象检测方案；据所知，在当前技术中尚未出现比它更快的对象检测系统。其平均精度达到52.7%，远高于实时检测领域的现有工作。该系统将平均精度提升至63.4%，同时保持了实时性能。

翻译：

我们基于VGG-16架构对YOLO进行训练。相较于YOLO而言，在准确度上却明显不如；然而在速度上却明显不如。相比于依赖于VGG-16的其他检测系统而言，在某些方面它确实具有实用性；然而由于其运行速度较慢，在实时性方面存在不足。本文后续将着重介绍我们的更快捷模型。

We utilize VGG-16 architecture for training YOLO. This model demonstrates higher precision yet notably slower performance relative to YOLO. It serves as a basis for comparative analysis with alternative detection systems relying on VGG-16; however, given its slower performance relative to real-time standards, the remainder of this paper focuses on our more efficient models.

翻译：

最佳的DPM能够在不牺牲mAP的前提下显著地增强了自身速度；然而这种情况下该方法仍会造成2倍的实时性能损失；相较于神经网络方法而言，DPM的检测精度明显偏低。

Fastest DPM以显著的速度提升其性能而并未在mAP上有所牺牲[37]。然而该方法仍未能达到实时性能水平... 它受限于DPM在检测精度方面的相对较低表现与基于神经网络方法的差距。

翻译：

该方法采用静态边界提示方案（参考文献[20]）取代传统选择性搜索机制。尽管其运行速度较之于原始的R-CNN有所提升，然而在实时性能方面仍存在明显不足，并导致整体检测精度出现了显著下降。这一缺陷并未得到有效的改进。

R-CNN minus R substitutes the selective search algorithm with static boundary box proposals [20]. While it is significantly faster than the original R-CNN model, this modification still falls short of achieving real-time performance and results in a notable accuracy reduction due to reliance on subpar proposal generation.

翻译：

高效R-CNN显著提高了R-CNN的分类阶段的速度；然而尽管如此，在这一过程中仍需依靠选择性搜索；因为其在 $mAP$ 方面表现优异，在每幅图像约需2秒钟来提取包围框建议的情况下（即当处理速度达到0.5帧/秒时），与其相比还有较大的差距）。

This modification accelerates the classification stage but does not eliminate the reliance on selectivesearch for generating bounding box proposals, which typically takes about 2 seconds per image. Consequently, while achieving strong performance in terms of mAP, its operational speed of only 0.5 frames per second remains insufficient for real-time applications.

翻译：

最近Faster R-CNN替代了选择性搜索以神经网络的方式生成边界框,如同Szegedy等人所提出的[8].在我们的测试实验中发现,其最精确版本实现了7帧/秒的帧率,相比之下,较为简单但精度稍低的版本则仅能稳定在18帧/秒.值得注意的是,Faster R-CNN基于vgg16网络实现的版本相较于YOLO提升了约10个mAP指标,然而却牺牲了大约6倍的速度优势.相比之下,ZeilerFerguson提出的改进版虽然速度仅低2.5倍,但其预测精度仍未能达到令人满意的效果.

Recent versions of Faster R-CNN have substituted selective search with neural networks for bounding box detection, akin to the approach by Szegedy et al. [8]. Their most precise model achieved 7 frames per second, whereas the less precise variant operated at 18 frames per second. The VGG-16 variant of Faster R-CNN scored 10 mAP higher but was only marginally slower compared to YOLO. The Zeiler-Fergus variant of Faster R-CNN performed only slightly worse than YOLO but lacked in accuracy as well.

4.2 VOC 2007错误分析

翻译：

为深入探究YOLO与其他先进检测技术之间的差异, 我们将对2007年挥发性有机化合物的检测结果进行详细分析. 对比YOLO与Fast RCNN, 由于Fast R-CNN在PASCAL平台上的卓越性能,AUCOCapability of Fast R-CNN is widely accessible.

To conduct an in-depth analysis of the differences between YOLO and state-of-the-art detectors, we perform a comprehensive assessment of results from the VOC 2007 dataset. We evaluate YOLO against Fast R-CNN. The Fast R-CNN framework has demonstrated exceptional performance on the PASCAL benchmarks, with its detection capabilities being publicly accessible for evaluation.

翻译：

在采用Hoiem等人[19]所提出的机器学习方法和相关工具的过程中对每个类别进行测试时

•正确:正确的类和IOU > .5

•定位:正确的类，.1 < IOU < .5

•相似:类相似，IOU > .1

原文: 可修改后右键重新翻译

Following the methodology and tools outlined by Hoiem et al. [19], during test phase, for each category, we examine its top N predicted instances. Each prediction falls into one of several categories based on the type of error made: Correctly classified instances with an intersection over union (IoU) exceeding 0.5; instances correctly localized but with lower IoU between 0.1 and 0.5; and similar instances identified when IoU exceeds 0.1.]

翻译：

YOLO致力于准确地识别目标物体，在其产生的误检中发现，在定位上的误检比例高于其他类型误检总和。相比之下，在Fast R-CNN系统中虽然定位上的误检有所减少但仍存在较严重的背景区域误检问题。约13.6%的顶级检测结果属于虚报情况即系统误判无物存在的情况，并且Fast R-CNN预测背景检测的可能性几乎是YOLO系统的三倍

YOLO在定位物体方面表现出明显的困难。YOLO的错误主要源于定位错误而非其他因素综合。相比之下，Fast R-CNN在定位上的误差减少明显但误报背景的情况却有所增加。约13.6%的检测结果误将背景当做了物体。与YOLO相比，Fast R-CNN预测背景误报的可能性高出近三倍。

翻译：

4.3 结合Fast R-CNN和YOLO

在减少背景错误方面，YOLO的表现优于Fast R-CNN。通过消除Fast R-Cnn中的背景检测问题,YOlo实现了显著性能提升.针对R-cnn所预测的每一个边界框,在Yolo中进行了相应的检查.如果是这样,则会根据Yolo所给出的概率值以及两个矩形区域的重叠程度来优化该预测结果.

YOLO exhibits notably fewer background misdetections relative to Fast R-CNN. Eliminating background detections via YOLO from Fast R-CNN yields a notable enhancement in performance. For each bounding box predicted by R-CNN, we assess whether YOLO also predicts a corresponding box. If so, this prediction is enhanced by factoring in the probability indicated by YOLO as well as the degree of overlap between the two boxes.

翻译：

该方法在VOC 2007测试集上达成了71.8%的平均精度（mAP）。当将其与YOLO技术相结合时，该方法的平均精度提升了3.2个百分点至75.0%。我们进一步探索了将顶级Fast R-CNN模型与其他多个版本Fast R-CNN模型融合的可能性。这些融合实验带来了轻微提升（约为0.3%至0.6%），具体结果可见表2中的详细数据展示

The best Fast R-CNN模型在VOC 2007测试集上达到了mAP值为71.8%。当与YOLO结合使用时，在验证集上的平均精度（mAP）提升至75.0%。我们还尝试将该顶级Fast R-CNN模型与其他版本的Fast R-CNN进行了融合。这些融合方法在mAP值上仅增加了大约0.3到0.6个百分点，请参见表2以获取详细信息。

翻译：

YOLO的技术成果主要源于模型整合的技术融合，并非单纯的技术融合带来的副产物。事实上，在这一过程中并无显著优势可言。具体而言，在经过严格测试后发现存在多样的缺陷与不足，这使得YOLO得以在其应用领域中展现出显著的能力与价值。

The improvement attributed to YOLO isn't merely a side effect of model ensembling; instead, there's minimal advantage gained by combining various versions of Fast R-CNN. Rather, the reason lies in YOLO's capacity to make distinct types of errors during testing, which allows it to significantly enhance the performance of Fast R-CNN.

翻译：

遗憾的是，这种组合无法从中获得速度优势。原因在于我们各自独立运行各自的模型，并再整合结果。然而，在运算速度方面由于YOLO运算极快，其额外计算开销微乎其微。相较于另一个快速的主流检测器R-CNN而言，则不会带来任何显著性差异。

Unfortunately, this combination does not leverage the speed advantage of YOLO because each model is run independently and their results are merged. However, due to its remarkable speed, YOLO only incurs negligible computational overhead when compared to Fast R-CNN.

4.4 VOC 2012结果

翻译：

2012 VOC测试中,YOLO检测系统的准确率达到57.9%,相较于当前技术,该表现尚显不足,其性能与基于vga -16的传统R-CNN架构（见表3）相媲美。与当前最佳系统相比,YOLO仅在处理小型物体方面表现出色。在瓶类、绵羊以及电视/显示器等类别中,YOLO的表现比传统R-CNN或特征编辑法（Feature Edit）低8-10个百分点。然而,在其他类别中,如猫类或其他训练目标,YOLO反而取得了更好的成绩。将快速R-CNN与YOLO相结合的方法已成为性能最为突出的目标检测方案之一。此方案较传统方法提升了约2.3%,并在公共排行榜上名次上升了5个位置。

On the VOC 2012 test set, YOLO achieved a mAP score of 57.9%. This figure falls short of the current state-of-the-art performance, being closer to that of R-CNN using VGG-16 architecture as detailed in Table 3. Our system shows particular challenges in detecting small objects relative to its closest competitors. For categories such as bottle, sheep, and tv/monitor, YOLO's performance is approximately 8-10% lower compared to R-CNN or Feature Edit methods. However, in other categories like cats and trains, YOLO demonstrates superior performance. The integration of our Fast R-CNN model with YOLO yields one of the highest performing detection systems currently available. The combination not only improves upon YOLO's original performance but also boosts its ranking on the public leaderboard by five positions compared to standalone YOLO.

4.5 概括性:艺术作品中的人物发现

翻译：

基于相同分布的数据集构建了用于目标检测的学术研究。真实世界中的应用往往难以全面覆盖所有可能的情境，在这种情况下，测试样本容易出现与系统先验知识不一致的情况。对比实验采用了YOLO算法与其他基准模型如毕加索艺术库[12]以及人民艺术画廊[3]的数据进行性能评估

原文: 可修改后右键重新翻译

Academic datasets for object detection utilize training and testing sets sourced from a consistent distribution. In real-world scenarios, it is challenging to anticipate all potential applications, as test data may diverge from previously encountered instances [3]. When evaluating YOLO's performance against other detection systems, we utilize the Picasso Dataset [12] and the People-Art Dataset [3], both of which serve as benchmark resources for assessing person detection accuracy in artistic works.

翻译：

图5展示了YOLO与其他检测方法的性能对比。作为参考依据，在VOC 2007中对person目标进行了AP评估，并要求所有参与模型仅在此数据集上进行训练。基于毕加索的作品，在VOC 2012数据集上的训练同样适用于人物艺术领域内的作品，在此数据集上的训练也得到了验证。

原文: 可修改后右键重新翻译

如图5所示，在对比检测性能方面，YOLO与其它方法均表现出了显著的优势。作为参考，在此我们提供VOC 2007检测AP值的具体数据，在这种情况下所有模型都被训练仅基于VOC 2007的数据集。值得注意的是，在这个案例中Picasso模型被训练使用的是VOC 2012数据集；而People-Art模型则基于VOC 2010的数据进行了训练。

翻译：

在2007年VOC比赛中表现出色的是一种名为R-CNN的方法。然而当将其应用于艺术品领域时其性能明显下降该方法采用基于选择性搜索的边界框提案策略并且专为自然图像设计其分类器阶段仅关注极小区域依赖于有效的辅助信息。

原文: 可修改后右键重新翻译

R-CNN achieves an excellent AP score on the VOC 2007 dataset. However, it falters significantly when applied to artwork. While employing Selective Search for bounding box proposals, R-CNN optimizes this process specifically for natural images. The classifier within the R-CNN framework typically processes only small regions, necessitating accurate and precise bounding box proposals.

翻译：

当将DPM应用于艺术作品时，它表现出良好的AP维持能力。基于先前的研究发现，在这种应用中，其依靠强大的物体形状和空间布局特征展现出卓越的表现能力。相比之下，尽管与R-CNN相比其降级程度较小，在性能起点上却通常起始于较低水平的AP。

原文: 可修改后右键重新翻译

DPM achieves a good average precision (AP) when effectively applied to artwork. Prior studies argue that DPM functions well due to its robust spatial modeling capabilities concerning object shapes and layouts. Although DPM does not degrade as severely as R-CNN, it begins with a relatively lower initial AP.

翻译：

YOLO在VOC 2007上的表现十分出色，在艺术作品中的平均精度（AP）指标上较其他方法具有更好的表现

原文: 可修改后右键重新翻译

该算法在VOC 2007数据集上展现出良好的性能，并且其平均精度（AP）的下降幅度小于其他方法，在应用于艺术作品时的表现尤为突出。与DPM模型类似,YOLO算法不仅关注物体本身还考虑它们之间的关系以及物体通常出现的位置。尽管艺术图像与自然图像在像素级别上有显著差异,但在物体大小和形状方面却表现出高度一致性,因此在这种情况下Yolo仍能准确预测边界框并进行有效检测。

5 在野外进行实时检测

翻译：

YOLO是一种基于深度学习的高效物体检测框架，在计算机视觉领域占据重要地位。我们将其安装于一个摄像头系统中，并评估其实时性能表现：包括从摄像头获取图像信息以及实时显示检测结果的能力。

原文: 可修改后右键重新翻译

YOLO is a fast and accurate object detector, highly effective in computer vision applications. We interfaced YOLO with a webcam and validated its real-time performance, which encompasses the time taken to retrieve images from the camera and display the detected objects.

翻译：

由之而生的系统即具交互性也具吸引力。即使仅能处理单个图像者，在连接至网络摄像头后亦可充当追踪器，在物体移动或外观变化时实时追踪之。其演示视频及源代码均可访问我们的项目网站：http://pjreddie.com/yolo/。

原文: 可修改后右键重新翻译

该系统具有交互性和趣味性。尽管YOLO分别处理单个图像时表现出独立性，在与摄像头连接后则像一个追踪系统，在物体移动和变化时进行检测。一个演示视频以及源代码可以在我们的项目网站上找到：http://pjreddie.com/yolo。

6 结论

翻译：

本文提出了一种整合性目标检测框架YOLO。该模型架构简洁明了，在完整图像上即可完成训练过程。相较于基于分类器的传统方法而言，YOLO采用了直接关联于检测性能的损失函数进行优化，并通过联合优化实现整体提升。Fast YOLO树立了当前通用目标检测领域的速度新标杆，并显著促进了实时目标检测技术的进步。此外，在多个新兴应用场景中展现出卓越适应性后缀功能支持下,YOLO已成为高效可靠的目标探测方案的理想候选。

原文: 可修改后右键重新翻译

YOLO represents a unified approach to object detection tasks. Constructing our model proves to be straightforward, enabling direct training on complete images. Unlike other methods that rely on classifiers, YOLO employs a loss function directly tied to detection metrics and trains the entire system together. Fast YOLO stands out as the most efficient general-purpose object detector, achieving state-of-the-art performance in real-time applications. Moreover, YOLO demonstrates strong adaptability across diverse domains, making it an excellent choice for scenarios requiring rapid and dependable object detection.

全部评论 (0)

还没有任何评论哟~

论文翻译：You Only Look Once: Unified, Real-Time Object Detection

摘要翻译：提出了一种新的目标检测方法YOLO。以前关于目标检测的工作使用分类器来执行检测。相反，我们将目标检测框架为一个回归问题，以特定的分离边界盒和相关的类概率。单个神经网络在一次评估中直接从完...

【YOLOv1原文+翻译】You Only Look Once Unified, Real-Time Object Detection

最近新出了YOLOV4，我系统的从V1开始整理出稿，传送门： [【YOLOv1原文+翻译】YouOnlyLookOnceUnified,RealTimeObjectDetection] [【YOLOv...

【YOLOv1原文+翻译】You Only Look Once Unified, Real-Time Object Detection

最近新出了YOLOV4，我系统的从V1开始整理出稿，传送门：【YOLOv1原文+翻译】YouOnlyLookOnceUnified,RealTimeObjectDetection 【YOLOv2原文...

You Only Look Once: Unified，Real-Time Object Detection总结翻译

摘要我们介绍了一种新的对象检测方法名为YOLO。在进行目标检测前，要进行分类来进行检测。相反，我们将对象检测框架变为空间分离的边界框和相关类别概率的回归问题。单个神经网络在一次评估中可以直接从完整的...

You Only Look Once: Unified, Real-Time Object Detection

YouOnlyLookOnce:Unified,RealTimeObjectDetection 背景介绍在深度神经网络之前，早期的Objectdetection方法是通过提取图像的一些robust的...

You Only Look Once: Unified, Real-Time Object Detection

文章目录 Abstract 1\.Introduction 2\.UnifiedDetection 2.1.NetworkDesign 2.2.Training 2.3.Inference 2.4.L...

(YOLO)You Only Look Once: Unified, Real-Time Object Detection

论文地址：https://arxiv.org/abs/1506.02640 摘要我们提出了YOLO，一种新的物体检测方法。之前的物体检测工作是通过重新使用分类器来进行检测。相反，我们将对象检测抽象为...

【论文阅读】You Only Look Once: Unified, Real-Time Object Detection

论文简介： YOLO是作者针对RCNN系列等的Twostage的目标检测算法的时间性能较低，训练、预测Pipeline分散复杂等的缺点，提出的一种更接近端到端的目标检测算法。其具有更加统一、清晰的结构...

论文学习：《You Only Look Once: Unified,Real-Time Object Detection》

1\.YOLO算法核心思想与主要贡献在这篇文章之前，图像目标检测算法普遍采用的办法是，利用一个可滑动窗口检测图像，得到可能存在目标的区域（或候选框），然后对每个区域分别执行分类任务，进行识别。这样的...

You Only Look Once: Unified, Real-Time Object Detection 论文解读

本文章用以JMUAIA（集美大学人工智能协会）于2024/01/27发布在课堂派上的期末考核FINAL。论文来自于期末考核附件。论文中提到的YOLO项目网站一、 YOLO（YouOnlyLook...

是否确定退出登录?

论文翻译：You Only Look Once: Unified, Real-Time Object Detection

摘要

1.说明

2 统一检测

2.1 网络设计

2.2 训练

2.3 推论

2.4 YOLOv3的限制

3 与其他检测系统的比较

4 实验

4.1 与其他实时系统的比较

4.2 VOC 2007错误分析

4.3 结合Fast R-CNN和YOLO

4.4 VOC 2012结果

4.5 概括性:艺术作品中的人物发现

5 在野外进行实时检测

6 结论

全部评论 (0)

相关文章推荐

论文翻译：You Only Look Once: Unified, Real-Time Object Detection

【YOLOv1原文+翻译】You Only Look Once Unified, Real-Time Object Detection

【YOLOv1原文+翻译】You Only Look Once Unified, Real-Time Object Detection

You Only Look Once: Unified，Real-Time Object Detection总结翻译

You Only Look Once: Unified, Real-Time Object Detection

You Only Look Once: Unified, Real-Time Object Detection

(YOLO)You Only Look Once: Unified, Real-Time Object Detection

【论文阅读】You Only Look Once: Unified, Real-Time Object Detection

论文学习：《You Only Look Once: Unified,Real-Time Object Detection》

You Only Look Once: Unified, Real-Time Object Detection 论文解读