當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection

發(fā)布時(shí)間：2025/4/5 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

0.Abstract
- 0.1原文翻譯
- - - 第一段（說明本次研究和之前研究的區(qū)別）
    - 第二段（速度快、雖然錯(cuò)誤率高一點(diǎn)，但是背景被錯(cuò)誤標(biāo)記的概率更低）
- 0.2總結(jié)
1. Introduction
- 1.翻譯
- - - 第一段（研究具有意義：1.這個(gè)和人類似2.這個(gè)可以輔助解決很多實(shí)際問題）
    - 第二段（分成proposal和classification兩步進(jìn)行的缺陷）
    - 第三段（YOLO的優(yōu)勢(shì)）
    - 第四段（YOLO可以直接得到結(jié)果）
    - 第五段（首先，YOLO的實(shí)時(shí)性，并且別別的實(shí)時(shí)系統(tǒng)強(qiáng)很多）
    - 第六段（第二，YOLO考慮全局的情況，因?yàn)榭紤]全局的信息，所以在背景上犯錯(cuò)犯的更少）
    - 第七段（第三，泛化能力好）
    - 第八段（還有不足：總體精度低，對(duì)小物體識(shí)別不好）
    - 第九段
- 1.2總結(jié)
2. Unified Detection
- 2.0總體描述
- 2.0.1翻譯
- - - 第一段（proposal和classification一次解決、全局因素考慮、端到端）
    - 第二段（預(yù)測(cè)結(jié)果的說明）
    - 第三段（介紹模型的評(píng)價(jià)方式）
    - 第四段（預(yù)測(cè)結(jié)果的說明）
    - 第五段（詳細(xì)講預(yù)測(cè)結(jié)果當(dāng)中分類的事情）
    - 第六段（評(píng)價(jià)預(yù)測(cè)結(jié)果設(shè)計(jì)的邏輯合理性）
    - 第七段（輸出格式的說明）
    - 圖片內(nèi)容Fig.2
  - 2.0.2總結(jié)
- 2.1. Network Design
- - 2.1.1逐句翻譯
  - - 第一段（數(shù)據(jù)集和總體的模型提出）
    - 第二段（更加詳細(xì)的介紹網(wǎng)絡(luò)）
    - 第三段（簡(jiǎn)單介紹Fast YOLO）
    - 第四段（介紹輸出）
  - 2.1.2總結(jié)
- 2.2. Training
- - 逐句翻譯
  - - - 第一段（大約就是講作者怎么實(shí)現(xiàn)的）
      - 第二段（大約就是在原來預(yù)訓(xùn)練網(wǎng)絡(luò)上增加層次和提升分辨率）
      - 第三段（將標(biāo)準(zhǔn)化）
      - 第四段（torch.nn.LeakyReLU()）
      - 第五段（同時(shí)完成坐標(biāo)誤差和classification誤差的evaluation）
      - 第六段（怎么實(shí)現(xiàn)這種不平均）
      - 第七段（均衡的大小格子的差異的問題）
      - 第八段（指定特定的預(yù)測(cè)器來進(jìn)行預(yù)測(cè)）
      - 第九段（介紹損失函數(shù)）
      - 第十段（loss函數(shù)單獨(dú)penalize一個(gè)因素）
      - 第十一段（訓(xùn)練實(shí)際操作的問題）
      - 第十二段（學(xué)習(xí)率的選擇）
      - 第十三段（數(shù)據(jù)增強(qiáng)）
- 2.3. Inference
- - 2.3.1逐句翻譯
  - - - 第一段（test也和train一樣一次就出結(jié)果）
      - 第二段（介紹一個(gè)極大抑制的問題）
  - 2.3.2總結(jié)
- 2.4. Limitations of YOLO（大約可以理解為不足）
- - 2.4.1逐句翻譯
  - - - 第一段（相互臨近的物體模型很難處理）
      - 第二段（可能會(huì)受到輸入圖片的情況的影響、并且很難保證有效的）
      - 第三段（大小bounding box的問題）
  - 2.4.2總結(jié)—這段在說模型的不足
3.Comparison to Other Detection Systems
- 3.1逐句翻譯
- - - - 第一段（大約就是陳述了之前的都是分成兩個(gè)過程來走）
      - 第二段（）
4. Experiments
- 4.0 寫在前面
- - 4.0.1逐句翻譯
  - 4.0.2總結(jié)
- 4.1 Comparison to Other Real-Time Systems（和最近的研究相比）
- - - - 第一段（之前真正的實(shí)時(shí)性系統(tǒng)并不多，只有幾個(gè)，就算不是實(shí)時(shí)系統(tǒng)我們也做了對(duì)比實(shí)驗(yàn)評(píng)估m(xù)AP降低換來的時(shí)間提升劃算嗎）
      - 第二段（從Fast YOLO引出YOLO真快）
      - 第三段（和VGG16的結(jié)合雖然準(zhǔn)但是慢，其實(shí)真正的YOLO比VGG淺）
      - 第四段（R-CNN minus R的失敗）
      - 第五段（Fast R-CNN雖然很快但是仍有延遲）
      - 第六段（ Faster R-CNN The Zeiler-Fergus Faster R-CNN 雖然可以更快，都沒有YOLO這么準(zhǔn)）
  - 4.2. VOC 2007 Error Analysis
  - - - 第一段
- 4.6總結(jié)
5.致歉

0.Abstract

0.1原文翻譯

第一段（說明本次研究和之前研究的區(qū)別）

We present YOLO, a new approach to object detection.
提出了一種新的目標(biāo)檢測(cè)方法YOLO。

Prior work on object detection repurposes classifiers to perform detection.
先前的對(duì)象檢測(cè)工作使分類器重新進(jìn)行檢測(cè)。（也就是所謂的需要進(jìn)行兩次檢驗(yàn)）

Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
相反，我們將目標(biāo)檢測(cè)作為一個(gè)回歸問題，用來完成空間分離的邊界框和相關(guān)的類概率。（這里大約就是說，用一張圖片這個(gè)很多內(nèi)容，直接回歸出來一個(gè)目標(biāo)檢測(cè)的結(jié)果，這里的所謂目標(biāo)檢測(cè)結(jié)果：其實(shí)只是一個(gè)S×S×B的（x、y、w、h、c）和（分類的概率張量）這里前面S×S×B只是一個(gè)個(gè)數(shù)，后面的是真正的回歸結(jié)果。

A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
一個(gè)單一的神經(jīng)網(wǎng)絡(luò)預(yù)測(cè)邊界盒和類概率直接從完整的圖像在一次評(píng)估。

Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
由于整個(gè)檢測(cè)管道是一個(gè)單一的網(wǎng)絡(luò)，它可以端到端直接對(duì)檢測(cè)性能進(jìn)行優(yōu)化。
（我理解這里的pipline是原有的架構(gòu)確定的一種描述，以前的計(jì)算圖都是固定的）
（end-to-end的大約意思就是直接從一張圖就可以得到一張預(yù)測(cè)圖）

第二段（速度快、雖然錯(cuò)誤率高一點(diǎn)，但是背景被錯(cuò)誤標(biāo)記的概率更低）

Our unified architecture is extremely fast.
我們的架構(gòu)非常快
Our base YOLO model processes images in real-time at 45 frames per second.
我們的基本YOLO模型以每秒45幀的速度實(shí)時(shí)處理圖像。

A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
一個(gè)更小的網(wǎng)絡(luò)，Fast YOLO，處理速度驚人的155幀每秒，同時(shí)仍然實(shí)現(xiàn)了兩倍于其他實(shí)時(shí)探測(cè)器的mAP。

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.
與最先進(jìn)的檢測(cè)系統(tǒng)相比，YOLO的定位誤差更大，但預(yù)測(cè)背景誤報(bào)的可能性更小。

Finally, YOLO learns very general representations of objects.
最后，YOLO學(xué)習(xí)非常一般的對(duì)象表示。

It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
它比其他檢測(cè)方法，包括DPM和R-CNN，從自然圖像泛化到其他領(lǐng)域，如藝術(shù)品。

0.2總結(jié)

1.只用看一次。
2.因?yàn)橹挥每匆淮嗡蕴幚硭俣瓤臁?/li>
3.因?yàn)橹豢戳艘淮嗡钥偟腻e(cuò)誤率稍高一點(diǎn)。
4.雖然總的錯(cuò)誤率高，但是在背景處理上犯錯(cuò)少啊。

1. Introduction

1.翻譯

第一段（研究具有意義：1.這個(gè)和人類似2.這個(gè)可以輔助解決很多實(shí)際問題）

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact.
人們瞥一眼圖像，就能立即知道圖像中的物體是什么，它們?cè)谀睦?#xff0c;以及它們是如何互動(dòng)的。（人看一眼就知道：東西在哪和運(yùn)動(dòng)趨勢(shì)就知道了）

The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought.
人類的視覺系統(tǒng)是快速和準(zhǔn)確的，允許我們執(zhí)行復(fù)雜的任務(wù)，如駕駛很少有意識(shí)的思考。

Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
快速、準(zhǔn)確的目標(biāo)檢測(cè)算法將使計(jì)算機(jī)在不需要專門傳感器的情況下駕駛汽車，使輔助設(shè)備能夠向人類用戶傳遞實(shí)時(shí)場(chǎng)景信息，并為通用、靈敏的機(jī)器人系統(tǒng)釋放潛力。

第二段（分成proposal和classification兩步進(jìn)行的缺陷）

Current detection systems repurpose classifiers to perform detection.
當(dāng)前的檢測(cè)系統(tǒng)重新利用分類器來執(zhí)行檢測(cè)。

To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image.
為了檢測(cè)一個(gè)對(duì)象，這些系統(tǒng)為該對(duì)象選取一個(gè)分類器，并在測(cè)試圖像的不同位置和尺度上對(duì)其進(jìn)行評(píng)估。

Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
像可變形部件模型(DPM)這樣的系統(tǒng)使用滑動(dòng)窗口方法，其中分類器在整個(gè)圖像[10]上均勻間隔的位置上運(yùn)行。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes.
最近的一些方法，如R-CNN，使用區(qū)域建議方法首先在圖像中生成潛在的邊界框，然后在這些被建議的框上運(yùn)行分類器。（之前全部掃描一次，這里是部分掃描一次，選定一些可能的區(qū)域）

After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13].
分類完成后，使用后處理對(duì)邊界框進(jìn)行細(xì)化，消除重復(fù)檢測(cè)，并基于場(chǎng)景中的其他物體對(duì)邊界框進(jìn)行重新計(jì)算分?jǐn)?shù)。[13]

These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
這些復(fù)雜的管道緩慢且難以優(yōu)化，因?yàn)槊總€(gè)單獨(dú)的組件都必須單獨(dú)訓(xùn)練。（這就像之前的PointNet一樣必須再中間的過程中優(yōu)化一個(gè)標(biāo)簽的生成）

第三段（YOLO的優(yōu)勢(shì)）

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.
我們將目標(biāo)檢測(cè)作為一個(gè)單一的回歸問題，直接從圖像像素到包圍盒坐標(biāo)和類概率。

Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
使用我們的系統(tǒng)，你只需看一幅圖像(YOLO)，就能預(yù)測(cè)出存在哪些物體以及它們的位置。

第四段（YOLO可以直接得到結(jié)果）

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
YOLO非常簡(jiǎn)單:參見圖1。一個(gè)卷積網(wǎng)絡(luò)可以同時(shí)預(yù)測(cè)多個(gè)邊界框和這些邊界框的類概率。

YOLO trains on full images and directly optimizes detection performance.
YOLO對(duì)全圖像進(jìn)行訓(xùn)練，直接優(yōu)化檢測(cè)性能。

This unified model has several benefits over traditional methods of object detection.
與傳統(tǒng)的目標(biāo)檢測(cè)方法相比，這種統(tǒng)一的模型有幾個(gè)優(yōu)點(diǎn)。

第五段（首先，YOLO的實(shí)時(shí)性，并且別別的實(shí)時(shí)系統(tǒng)強(qiáng)很多）

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.
首先，YOLO非常快。因?yàn)槲覀儼褭z測(cè)作為一個(gè)回歸問題，我們不需要一個(gè)復(fù)雜的管道。

We simply run our neural network on a new image at test time to predict detections.
我們只是在測(cè)試時(shí)對(duì)新圖像運(yùn)行我們的神經(jīng)網(wǎng)絡(luò)來預(yù)測(cè)檢測(cè)結(jié)果。

Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps.
我們的基本網(wǎng)絡(luò)運(yùn)行速度是每秒45幀，在Titan X GPU上沒有批處理，而快速版本運(yùn)行速度超過150幀。

This means we can process streaming video in real-time with less than 25 milliseconds of latency.
這意味著我們可以以小于25毫秒的延遲實(shí)時(shí)處理流媒體視頻。

Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems.
此外，YOLO的平均精度是其他實(shí)時(shí)系統(tǒng)的兩倍以上。

For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
有關(guān)我們的系統(tǒng)在網(wǎng)絡(luò)攝像頭上實(shí)時(shí)運(yùn)行的演示，請(qǐng)查看我們的項(xiàng)目
:http://pjreddie.com/yolo/

第六段（第二，YOLO考慮全局的情況，因?yàn)榭紤]全局的信息，所以在背景上犯錯(cuò)犯的更少）

Second, YOLO reasons globally about the image when making predictions.
其次，在進(jìn)行預(yù)測(cè)時(shí)，YOLO會(huì)考慮到全局的情況。

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.
與基于滑動(dòng)窗口和區(qū)域建議的技術(shù)不同，YOLO在訓(xùn)練和測(cè)試期間查看整個(gè)圖像，因此它隱式地對(duì)類及其外觀的上下文信息進(jìn)行編碼。

Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context.
快速R-CNN是一種頂級(jí)檢測(cè)方法[14]，它會(huì)將圖像中的背景補(bǔ)丁誤認(rèn)為目標(biāo)，因?yàn)樗鼰o法看到更大的背景

YOLO makes less than half the number of background errors compared to Fast R-CNN.
與Fast R-CNN相比，YOLO產(chǎn)生的背景錯(cuò)誤不到前者的一半。

第七段（第三，泛化能力好）

Third, YOLO learns generalizable representations of objects.
第三，YOLO學(xué)習(xí)對(duì)象的概化表示。

When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin.
在自然圖像上進(jìn)行訓(xùn)練，在藝術(shù)品上進(jìn)行測(cè)試時(shí)，YOLO的性能遠(yuǎn)遠(yuǎn)優(yōu)于DPM和R-CNN等頂級(jí)檢測(cè)方法。

Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
由于YOLO是高度一般化的，它在應(yīng)用于新域或意外輸入時(shí)不太可能出現(xiàn)故障。

第八段（還有不足：總體精度低，對(duì)小物體識(shí)別不好）

YOLO still lags behind state-of-the-art detection systems in accuracy.
YOLO在準(zhǔn)確性方面仍然落后于最先進(jìn)的檢測(cè)系統(tǒng)。

While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.
雖然它可以快速識(shí)別圖像中的物體，但很難精確定位一些物體，尤其是小物體。

We examine these tradeoffs further in our experiments.
我們?cè)趯?shí)驗(yàn)中進(jìn)一步研究了這些權(quán)衡。

第九段

All of our training and testing code is open source. A variety of pretrained models are also available to download.
我們所有的培訓(xùn)和測(cè)試代碼都是開源的。各種預(yù)訓(xùn)練模型也可以下載。

圖1:YOLO檢測(cè)系統(tǒng)。使用YOLO處理圖像是簡(jiǎn)單而直接的。我們的系統(tǒng)(1)將輸入圖像的大小調(diào)整為448 × 448，(2)在圖像上運(yùn)行單個(gè)卷積網(wǎng)絡(luò)，(3)根據(jù)模型的置信度對(duì)檢測(cè)結(jié)果進(jìn)行閾值。
這里的confidence很好，因?yàn)檫@個(gè)東西無所謂概率的問題，這里直接提一個(gè)confidence。

1.2總結(jié)

1.實(shí)時(shí)性是個(gè)好東西，很有用，處理圖片數(shù)據(jù)更快很重要。
2.YOLO可以實(shí)現(xiàn)實(shí)時(shí)性。
3.YOLO因?yàn)榭紤]了全局的因素在背景識(shí)別上有很好的效果。
4.泛化能力好
5.但是在小物體的效果不好（作者也用比例優(yōu)化了這個(gè)問題）

2. Unified Detection

2.0總體描述

2.0.1翻譯

第一段（proposal和classification一次解決、全局因素考慮、端到端）

We unify the separate components of object detection into a single neural network.
我們將目標(biāo)檢測(cè)的各個(gè)部分統(tǒng)一成一個(gè)神經(jīng)網(wǎng)絡(luò)。（就是原來的proposal和classification直接合成一個(gè)了）

Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.
我們的網(wǎng)絡(luò)使用整個(gè)圖像的特征來預(yù)測(cè)每個(gè)邊界框。它還可以同時(shí)預(yù)測(cè)圖像中所有類的所有邊界框。這意味著我們的網(wǎng)絡(luò)對(duì)完整圖像和圖像中的所有對(duì)象進(jìn)行了全局分析。

The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
YOLO設(shè)計(jì)使端到端訓(xùn)練和實(shí)時(shí)速度，同時(shí)保持高平均精度。

第二段（預(yù)測(cè)結(jié)果的說明）

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我們的系統(tǒng)將輸入圖像劃分為一個(gè)S × S網(wǎng)格。如果一個(gè)對(duì)象的中心落在一個(gè)網(wǎng)格單元中，該網(wǎng)格單元負(fù)責(zé)檢測(cè)該對(duì)象。（這個(gè)正是這個(gè)YOLO最核心的一件事情，就是將整個(gè)圖片轉(zhuǎn)化成一個(gè)個(gè)grid，每個(gè)grid負(fù)責(zé)B個(gè)bounding box。但是這個(gè)B個(gè)bounding boxs卻只有一組分類信息和她們對(duì)應(yīng)，這就導(dǎo)致了這B個(gè)bounding box必須是一個(gè)類別。這就導(dǎo)致了YOLO在預(yù)測(cè)相互接近的不同類別的物體過程中表現(xiàn)不好。）

第三段（介紹模型的評(píng)價(jià)方式）

Each grid cell predicts B bounding boxes and confidence scores for those boxes.
每個(gè)網(wǎng)格單元預(yù)測(cè)B個(gè)邊界框和這些框的置信值。（為什么預(yù)測(cè)這些邊框，因?yàn)檫@些邊框的中心落在了這個(gè)內(nèi)容的中心。）

These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
這些信心分?jǐn)?shù)反映了模型對(duì)盒子包含一個(gè)物體的信心程度，以及模型認(rèn)為盒子對(duì)其預(yù)測(cè)的準(zhǔn)確性。（這個(gè)盒子包含兩件事一個(gè)是有沒有東西一個(gè)是預(yù)測(cè)的對(duì)不對(duì)，其實(shí)，讀下面就可以知道這里同時(shí)反映兩件事是把這兩個(gè)東西乘起來的。）

Formally we define confidence as
在形式上，我們把信心定義為（這個(gè)東西我們可以看出來是兩個(gè)東西相乘）

Pr（Object）如果這里有對(duì)象就是1沒有就是0
IOU這個(gè)是個(gè)熟系的內(nèi)容了。

If no object exists in that cell, the confidence scores should be zero.
如果該單元格中不存在任何對(duì)象，則置信度分?jǐn)?shù)應(yīng)為零。（讓Pr變成0就是了）

Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
否則，我們希望置信度得分等于預(yù)測(cè)框和地面真理之間的并集(IOU)的交集。

第四段（預(yù)測(cè)結(jié)果的說明）

Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
每個(gè)邊界框由5個(gè)預(yù)測(cè)組成:x, y, w, h和置信度。

The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.
(x, y)坐標(biāo)表示一個(gè)bounding box的中心，他的預(yù)測(cè)是和這個(gè)grid相互關(guān)聯(lián)的。

The width and height are predicted relative to the whole image.
寬度和高的預(yù)測(cè)是結(jié)合整個(gè)圖片進(jìn)行預(yù)測(cè)的

Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
這里的confidence如果確實(shí)有不就是個(gè)IOU嗎

第五段（詳細(xì)講預(yù)測(cè)結(jié)果當(dāng)中分類的事情）

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object).
每個(gè)網(wǎng)格單元格還預(yù)測(cè)C條件類概率，Pr(Classi|Object)。
（這里算的概率是這里已經(jīng)是一個(gè)object的概率）

These probabilities are conditioned on the grid cell containing an object.
這些概率取決于包含對(duì)象的網(wǎng)格單元格。

We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
我們?cè)诿總€(gè)網(wǎng)格單元中只預(yù)測(cè)一組類概率，而不考慮盒子B的數(shù)量。
（這里B之前說過，也就是一個(gè)prid cell 當(dāng)中包含的候選bounding box的數(shù)量）也就是說這B個(gè)bounding boxs必須是同一類這就導(dǎo)致了之后YOLOv1在預(yù)測(cè)相互靠近的不同類別的物體時(shí)很無力。

第六段（評(píng)價(jià)預(yù)測(cè)結(jié)果設(shè)計(jì)的邏輯合理性）

At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在測(cè)試時(shí)，我們將條件類概率和單個(gè)框置信預(yù)測(cè)相乘，（也就是下面這個(gè)框，可以看到直接就得到了這個(gè)類的概率，作者在這大約也是想說明自己這個(gè)東西是符合數(shù)學(xué)規(guī)律的）

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
就給了我們每個(gè)箱子的特定類別的信心分?jǐn)?shù)。這些分?jǐn)?shù)既編碼了類出現(xiàn)在方框中的概率，也編碼了預(yù)測(cè)的方框與對(duì)象的匹配程度。

第七段（輸出格式的說明）

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20.
Our final prediction is a 7 × 7 × 30 tensor.

圖片內(nèi)容Fig.2

2.0.2總結(jié)

1.大約說的就是使用一個(gè)網(wǎng)格（grid）來識(shí)別一部分detection，并且明確了各個(gè)的評(píng)價(jià)標(biāo)準(zhǔn)，這里比較特別的分類任務(wù)一般選擇使用softmax最后生成一個(gè)分類個(gè)數(shù)的概率。這里直接生成一個(gè)detection的置信程度（confidence）來判定這個(gè)框是不是畫對(duì)了。
2.所以這里的一切的展開都是圍繞著一個(gè)grid展開的，并且每個(gè)grid雖然最后會(huì)評(píng)價(jià)B個(gè)框的confidence，但是最后在生成分類數(shù)據(jù)的時(shí)候，卻只生成一組分類信息也就是如果有C個(gè)分類就生成C個(gè)維度，所以這也就導(dǎo)致了前面的B個(gè)框只能有一個(gè)是有效的。

2.1. Network Design

2.1.1逐句翻譯

第一段（數(shù)據(jù)集和總體的模型提出）

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9].
我們將該模型作為一個(gè)卷積神經(jīng)網(wǎng)絡(luò)實(shí)現(xiàn)，并在PASCAL VOC檢測(cè)數(shù)據(jù)集上對(duì)其進(jìn)行評(píng)估。

The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
網(wǎng)絡(luò)的初始卷積層從圖像中提取特征，而全連接層預(yù)測(cè)輸出概率和坐標(biāo)。（在FCN出來之前大家一般都是使用全連接做最后的轉(zhuǎn)化。說起來這個(gè)東西其實(shí)學(xué)習(xí)能力很好，但是也很容易過擬合。）

第二段（更加詳細(xì)的介紹網(wǎng)絡(luò)）

Our network architecture is inspired by the GoogLeNet model for image classification [34].
我們的網(wǎng)絡(luò)架構(gòu)是受GoogLeNet圖像分類模型的啟發(fā)。

Our network has 24 convolutional layers followed by 2 fully connected layers.
們的網(wǎng)絡(luò)有24個(gè)卷積層，然后是2個(gè)全連接層。

Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
與googleet使用的初始模塊不同，我們簡(jiǎn)單地使用1 × 1的簡(jiǎn)化層，然后是3 × 3的卷積層，類似于Lin等人[22]。完整的網(wǎng)絡(luò)如圖3所示。

（這個(gè)1×1的卷積我覺得其具有比較突出的特征提取能力。作者這里是使用1×1的卷積擴(kuò)展通道和降低通道。說這么多大家可能還是不懂，看代碼其實(shí)你就可以發(fā)現(xiàn)作者是順序使用了1×1、3×3、1×1來進(jìn)行信息提取。

第三段（簡(jiǎn)單介紹Fast YOLO）

We also train a fast version of YOLO designed to push the boundaries of fast object detection.
我們還訓(xùn)練了一個(gè)YOLO的快速版本，旨在推動(dòng)快速目標(biāo)檢測(cè)的邊界。
（這里的push the boundaries大約就是說在推動(dòng)這個(gè)領(lǐng)域的發(fā)展吧）

Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers.
Fast YOLO使用的神經(jīng)網(wǎng)絡(luò)具有更少的卷積層(9個(gè)而不是24個(gè))和更少的層過濾器。
（過濾器：指的就是一個(gè)卷積層當(dāng)中的不同卷積核，可以參考：卷積核個(gè)數(shù)和輸入通道和輸出通道個(gè)數(shù)的關(guān)系）

Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
YOLO和Fast YOLO除了網(wǎng)絡(luò)大小不同外，所有訓(xùn)練和測(cè)試參數(shù)都是相同的。

第四段（介紹輸出）

The final output of our network is the 7 × 7 × 30 tensor of predictions.
我們的網(wǎng)絡(luò)的最終輸出是預(yù)測(cè)的7 × 7 × 30張量。

2.1.2總結(jié)

就是介紹網(wǎng)絡(luò)的輸入，并說明輸出的結(jié)構(gòu)。
大約就是：

1.網(wǎng)絡(luò)使用類似googleNet的結(jié)構(gòu)（時(shí)代的局限性），提取一個(gè)7×7的特征圖，每個(gè)像素就對(duì)應(yīng)一個(gè)grid。（這里注意特征圖當(dāng)中的每個(gè)像素實(shí)際上只有當(dāng)前這個(gè)grid當(dāng)中的信息）
2.將每個(gè)像素的不同channel的信息permute到一起，之后引入全連接層來講全局信息融入其中。（這里才能獲得全局信息）
3.再permute回去進(jìn)行信息提取。

2.2. Training

逐句翻譯

第一段（大約就是講作者怎么實(shí)現(xiàn)的）

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30].
我們?cè)贗mageNet 1000分類的比賽數(shù)據(jù)集上預(yù)訓(xùn)練我們的卷積層。[30]
（底層的預(yù)訓(xùn)練有助于提升網(wǎng)絡(luò)整體的性能）

For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer.
為了進(jìn)行預(yù)訓(xùn)練，我們使用圖3中的前20個(gè)卷積層，然后是一個(gè)平均池化層和一個(gè)完全連接層。（淺層網(wǎng)絡(luò)在之后的訓(xùn)練當(dāng)中很難得到充分優(yōu)化）

We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].
我們對(duì)這個(gè)網(wǎng)絡(luò)進(jìn)行了大約一周的訓(xùn)練，并在ImageNet 2012驗(yàn)證集上實(shí)現(xiàn)了88%的單一模型前5名的精度，這與Caffe的Model Zoo中的GoogLeNet模型相當(dāng)。（大約就是預(yù)訓(xùn)練已經(jīng)達(dá)到了很好的精度了）

We use the Darknet framework for all training and inference [26].
我們使用暗網(wǎng)框架進(jìn)行所有的訓(xùn)練和推理[26]。（就是這個(gè)作者之前提的一個(gè)架構(gòu)）

第二段（大約就是在原來預(yù)訓(xùn)練網(wǎng)絡(luò)上增加層次和提升分辨率）

We then convert the model to perform detection.
然后，我們將模型轉(zhuǎn)換為執(zhí)行檢測(cè)。

Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].
Ren等人表明，在預(yù)先訓(xùn)練的網(wǎng)絡(luò)中同時(shí)添加卷積層和連接層可以提高性能。[29]

Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights.
根據(jù)他們的例子，我們添加了四個(gè)卷積層和兩個(gè)權(quán)值隨機(jī)初始化的完全連接層。
(這里的隨機(jī)權(quán)重主要是區(qū)別于之前的經(jīng)過預(yù)訓(xùn)練的權(quán)重)

Detection often requires fine grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
檢測(cè)往往需要細(xì)粒度的視覺信息，因此我們將網(wǎng)絡(luò)的輸入分辨率從224 × 224提高到448 × 448。

第三段（將標(biāo)準(zhǔn)化）

Our final layer predicts both class probabilities and bounding box coordinates.
我們的最后一層預(yù)測(cè)類概率和邊界盒坐標(biāo)。

We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
我們通過圖像的寬度和高度規(guī)范化邊框的寬度和高度，使它們落在0和1之間。

We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我們將邊界框的x和y坐標(biāo)參數(shù)化為特定網(wǎng)格單元位置的偏移量，因此它們的邊界也在0和1之間。

第四段（torch.nn.LeakyReLU()）

We use a linear activation function for the final layer and all other layers use the following leaky（漏的） rectified linear activation:
我們?cè)谧詈笠粚邮褂镁€性激活函數(shù)，所有其他層使用以下泄漏修正線性激活:
（這個(gè)現(xiàn)在直接使用torch.nn.LeakyReLU()就可以創(chuàng)建了）

第五段（同時(shí)完成坐標(biāo)誤差和classification誤差的evaluation）

We optimize for sum-squared error in the output of our model.
我們優(yōu)化了模型輸出中的和平方誤差。
(sum-squared error整合localization error（bboxes的坐標(biāo)誤差）和classification error，因?yàn)槭且淮斡?xùn)練得到的坐標(biāo)和分類，所以得直接把兩者都evaluation了)

We use sum squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision.
我們使用求和平方誤差是因?yàn)樗苋菀變?yōu)化，但是它并不完全符合我們最大化平均精度的目標(biāo)。

It weights localization error equally with classification error which may not be ideal.
定位誤差與分類誤差的權(quán)重相等，這么評(píng)估誤差可能并不理想。
（分類很多，所以放在一鍋燉不合適）
Also, in every image many grid cells do not contain any object.
另外，在每個(gè)圖像中，許多網(wǎng)格單元不包含任何對(duì)象。

This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects.
這會(huì)將這些單元格的“confidence”分?jǐn)?shù)推向零，通常會(huì)壓倒包含對(duì)象的單元格的梯度。
（就是其中不包含元素的內(nèi)容太多了，如果全部一起評(píng)估，可能會(huì)出現(xiàn)問題就是不包含的object的grid得到了很好的優(yōu)化，但是真正預(yù)測(cè)object的grid優(yōu)化不充分）

This can lead to model instability, causing training to diverge early on.
這可能導(dǎo)致模型不穩(wěn)定，導(dǎo)致培訓(xùn)在早期就出現(xiàn)分歧。

第六段（怎么實(shí)現(xiàn)這種不平均）

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects.
為了彌補(bǔ)這一點(diǎn)，我們?cè)黾恿税鼑凶鴺?biāo)（這里指的是對(duì)具體坐標(biāo)的預(yù)測(cè)）預(yù)測(cè)的損失，并減少了不包含對(duì)象的盒的confidence預(yù)測(cè)的損失。

We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.
我們使用λ坐標(biāo)和λnoobj兩個(gè)參數(shù)來實(shí)現(xiàn)這一目標(biāo)。我們?cè)O(shè)置λcoord = 5和λnoobj = .5。

第七段（均衡的大小格子的差異的問題）

Sum-squared error also equally weights errors in large boxes and small boxes.
平方和誤差在大盒子和小盒子中的權(quán)重相等。

Our error metric（度量標(biāo)準(zhǔn)） should reflect that small deviations in large boxes matter less than in small boxes.
我們的誤差度量應(yīng)該反映出大盒子里的小偏差比小盒子里的小偏差更重要。

To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
為了部分解決這個(gè)問題，我們預(yù)測(cè)邊界框?qū)挾群透叨鹊钠椒礁?#xff0c;而不是直接預(yù)測(cè)寬度和高度。

（這里需要主要理解的是這樣的一個(gè)問題：這里之前我們不是已經(jīng)對(duì)weight和height作了標(biāo)準(zhǔn)化了嗎？理論上他們的大小應(yīng)該都在0到1之間，那么，為什么這里還要專門區(qū)分大小呢？
理解這個(gè)問題我們要理解幾個(gè)事情：

1.這里說的標(biāo)準(zhǔn)化對(duì)應(yīng)的是什么內(nèi)容的標(biāo)準(zhǔn)化？
其實(shí)整個(gè)項(xiàng)目當(dāng)中一共是使用了兩次標(biāo)準(zhǔn)化：
Firstly，在處理輸入數(shù)據(jù)集的時(shí)候，將輸入數(shù)據(jù)的框的weight和height的大小轉(zhuǎn)化到0到1之間，之后放入訓(xùn)練。
Second，對(duì)應(yīng)的應(yīng)當(dāng)是輸出結(jié)果，輸出結(jié)果是0到1，之后經(jīng)過反標(biāo)準(zhǔn)化，得到真正的結(jié)果。
2.這里計(jì)算損失的時(shí)候用的是什么內(nèi)容？
用的是反標(biāo)準(zhǔn)化出發(fā)來的正常數(shù)據(jù)，按照作者自己的描述是，浮點(diǎn)數(shù)直接計(jì)算損失太大，所以需要轉(zhuǎn)化為原來的數(shù)值計(jì)算損失。
3.所以這么轉(zhuǎn)化過去又轉(zhuǎn)化回來干啥呢？
轉(zhuǎn)化過去為什么叫norm，因?yàn)檫@個(gè)東西和normalization的的作用幾乎一樣，想要直接估計(jì)差異很大的數(shù)值的時(shí)候，很難得到很好的結(jié)果，還是轉(zhuǎn)換為0到1比較容易預(yù)測(cè)一點(diǎn)。
）

第八段（指定特定的預(yù)測(cè)器來進(jìn)行預(yù)測(cè)）

YOLO predicts multiple bounding boxes per grid cell.
YOLO預(yù)測(cè)每個(gè)網(wǎng)格單元格有多個(gè)邊界框。

At training time we only want one bounding box predictor to be responsible for each object.
在訓(xùn)練時(shí)，我們只希望一個(gè)邊界盒預(yù)測(cè)器負(fù)責(zé)每個(gè)對(duì)象。

We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.
我們基于哪個(gè)grid預(yù)測(cè)的結(jié)果和被預(yù)測(cè)物體真值有最高的交并比，指定一個(gè)特定的預(yù)測(cè)器來負(fù)責(zé)當(dāng)前這個(gè)對(duì)象的預(yù)測(cè)。
（這里的想法是很好的，但是有一個(gè)最大的問題就是，你在開始預(yù)測(cè)的時(shí)候怎么知道預(yù)測(cè)值，這就存在一個(gè)新的問題這樣我們訓(xùn)練集要怎么傳入的問題）

This leads to specialization between the bounding box predictors.
這導(dǎo)致了邊界框預(yù)測(cè)器之間的專門化。

Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
每種預(yù)測(cè)器都能更好地預(yù)測(cè)物體的特定大小、長(zhǎng)徑比或類別，從而提高整體recall（也就是發(fā)現(xiàn)目標(biāo)的能力）。

recall可以參考：什么是Precision和Recall？

第九段（介紹損失函數(shù)）

During training we optimize the following, multi-part loss function:
在training期間，我們優(yōu)化了以下多部分組成的損失函數(shù):

（這里之后細(xì)談）

第十段（loss函數(shù)單獨(dú)penalize一個(gè)因素）

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier).
請(qǐng)注意，loss函數(shù)只懲罰在該網(wǎng)格單元中存在對(duì)象的分類錯(cuò)誤(因此前面討論的是條件類概率)。
這里的解釋一下說的什么，大約就是這里loss針對(duì)的只是分類的條件概率的錯(cuò)誤，至于有沒有object的錯(cuò)誤，不是在條件概率這里體現(xiàn)的。他們五個(gè)是分的比較清楚的。

It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
它也只懲罰邊界盒坐標(biāo)誤差，如果預(yù)測(cè)器（grid）“負(fù)責(zé)”對(duì)應(yīng)的bounding box真值(即在網(wǎng)格單元中擁有最高的預(yù)測(cè)器IOU)。

也就是雖然這個(gè)事情是相關(guān)的事情，但是處理loss的時(shí)候，全是單獨(dú)預(yù)測(cè)的，不相互糾結(jié)。

第十一段（訓(xùn)練實(shí)際操作的問題）

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012.
我們使用 PASCAL VOC 2007 and 2012的測(cè)試集合驗(yàn)證集，訓(xùn)練我們的網(wǎng)絡(luò)135個(gè)epoch。

When testing on 2012 we also include the VOC 2007 test data for training.
在2012年的測(cè)試中，我們還包括了VOC 2007的測(cè)試數(shù)據(jù)。

Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
在整個(gè)訓(xùn)練過程中，我們使用64個(gè)批次，動(dòng)量為0.9，衰減為0.0005。（記一下）

第十二段（學(xué)習(xí)率的選擇）

一開始用比較小的學(xué)習(xí)率，之后再增大，再減小。因?yàn)橐婚_始用的太大會(huì)受到一開始不穩(wěn)定的梯度的影響。

第十三段（數(shù)據(jù)增強(qiáng)）

To avoid overfitting we use dropout and extensive data augmentation.
為了避免過擬合，我們使用了dropout和extensive data augment。（數(shù)據(jù)增強(qiáng)）

A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers .
在第一連接層之后，速率為= 0.5的dropout層阻止層之間的共同適應(yīng)。

For data augmentation we introduce random scaling and translations of up to 20% of the original image size.
對(duì)于數(shù)據(jù)增強(qiáng)，我們引入了高達(dá)原始圖像大小20%的隨機(jī)縮放和平移。
translations 應(yīng)當(dāng)理解為平移
這里同時(shí)也是一個(gè)圖像識(shí)別當(dāng)中一個(gè)比較重要的思想，也就是平移不變性。

We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
在HSV顏色空間中，我們還隨機(jī)調(diào)整了圖像的曝光和飽和度，最高可達(dá)到1.5倍。

2.3. Inference

2.3.1逐句翻譯

第一段（test也和train一樣一次就出結(jié)果）

Just like in training, predicting detections for a test image only requires one network evaluation.
就像在訓(xùn)練中，預(yù)測(cè)檢測(cè)測(cè)試圖像只需要一個(gè)網(wǎng)絡(luò)評(píng)估。

On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
在PASCAL VOC上，該網(wǎng)絡(luò)預(yù)測(cè)每個(gè)圖像的98個(gè)邊界框和每個(gè)框的類概率。與基于分類器的方法不同，YOLO在測(cè)試時(shí)非常快，因?yàn)樗恍枰獑蝹€(gè)網(wǎng)絡(luò)評(píng)估。

（這里可以類比理解一下torch的train狀態(tài)和eval狀態(tài)做出理解）

第二段（介紹一個(gè)極大抑制的問題）

The grid design enforces spatial diversity in the bounding box predictions.
網(wǎng)格設(shè)計(jì)加強(qiáng)了邊界盒預(yù)測(cè)的空間多樣性。（就是你什么樣子的圖只要能劃分成grid不就完事了嗎）

Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.
通常情況下，對(duì)象歸屬于哪個(gè)網(wǎng)格單元是很清楚的，并且網(wǎng)絡(luò)只對(duì)每個(gè)對(duì)象預(yù)測(cè)一個(gè)框。

However, some large objects or objects near the border of multiple cells can be well localized by multiple cells.
然而，一些較大的物體或靠近多個(gè)單元邊界的物體可以被多個(gè)單元很好地定位。

Non-maximal suppression can be used to fix these multiple detections.
非極大抑制可用于修復(fù)這些多重檢測(cè)。
（這個(gè)東西就是選取其中最大的那個(gè)有效，剩下的哪些讓他們抑制，這里理解一下，對(duì)于一個(gè)較大的物體，很有可能好幾個(gè)grid都覺得自己可以預(yù)測(cè)這個(gè)較大物體的位置，所以這個(gè)物體就會(huì)有好幾個(gè)，所以需要我們選定一個(gè)，這里就選定那個(gè)confidence最大的就好了）

While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
雖然非最大抑制并不像R-CNN或DPM那樣對(duì)性能至關(guān)重要，但非最大抑制增加了2-3%的mAP（mean Average Precision）。
可以參考：

什么是mAP?

2.3.2總結(jié)

介紹：
1.訓(xùn)練和測(cè)試一樣都很簡(jiǎn)單
2.非極大抑制，有的物體可能被很多框預(yù)測(cè)，所以不是confidence最大的框就會(huì)被抑制。

2.4. Limitations of YOLO（大約可以理解為不足）

2.4.1逐句翻譯

第一段（相互臨近的物體模型很難處理）

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class.
OLO對(duì)邊界框預(yù)測(cè)施加了很強(qiáng)的空間約束，因?yàn)槊總€(gè)網(wǎng)格單元格只能預(yù)測(cè)兩個(gè)框，并且只能有一個(gè)種類。
（這里理解一下，在代碼里每個(gè)grid雖然有B=2組的（x，y，w，h，c）理論上可以預(yù)測(cè)兩個(gè)不同的物體的兩個(gè)框，但是20個(gè)類別的概率輸出，只有一組，所以你預(yù)測(cè)的兩個(gè)物體必須是一個(gè)類別。
所以實(shí)際上，也就預(yù)測(cè)一個(gè)物體而已。
）

This spatial constraint limits the number of nearby objects that our model can predict.
這個(gè)空間限制限制了我們的模型可以預(yù)測(cè)的相互距離較近的物體的數(shù)量。
（就是里的近的幾個(gè)物體可能預(yù)測(cè)不到。）

Our model struggles with small objects that appear in groups, such as flocks of birds.
我們的模型和成群出現(xiàn)的物體是斗爭(zhēng)的，比如鳥群。
（我各人理解這里的struggle應(yīng)當(dāng)理解為費(fèi)力的解決了某事，也就是YOLOv1對(duì)解決成群的物體應(yīng)該還行，例如鳥群。之所以認(rèn)為其應(yīng)該還行，因?yàn)檫@種成群是同種物體，所以YOLOv1是大約可以解決的，YOLOv1解決不了的是相互靠近的不同物體。）

第二段（可能會(huì)受到輸入圖片的情況的影響、并且很難保證有效的）

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations.
由于我們的設(shè)計(jì)是為了學(xué)習(xí)一個(gè)模型：這個(gè)從數(shù)據(jù)中預(yù)測(cè)邊界框，所以它很難推廣到新的或不尋常的高寬比或配置的對(duì)象。

Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
我們的模型還使用相對(duì)粗糙的特征來預(yù)測(cè)邊界框，因?yàn)槲覀兊募軜?gòu)有多個(gè)從輸入圖像向下采樣的層。
（大約是下采樣的時(shí)候的分辨率下降的思考，應(yīng)該和deeplabv2是沒有關(guān)系的，因?yàn)閐eeplabv2是2017年出來的這個(gè)yolo是2016年出來的所以應(yīng)該沒關(guān)系）

第三段（大小bounding box的問題）

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes.
最后，當(dāng)我們訓(xùn)練一個(gè)近似檢測(cè)性能的損失函數(shù)時(shí)，我們的損失函數(shù)對(duì)小包圍盒和大包圍盒中的錯(cuò)誤處理是一樣的。

A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.
大框中的小錯(cuò)誤通常是良性的，但小框中的小錯(cuò)誤對(duì)借據(jù)的影響要大得多。

Our main source of error is incorrect localizations.
錯(cuò)誤的主要來源是框的位置的錯(cuò)誤。

2.4.2總結(jié)—這段在說模型的不足

不足如下：

1.相互靠近的物體預(yù)測(cè)能力不行，這是由于輸出的結(jié)構(gòu)導(dǎo)致的，輸出當(dāng)中每個(gè)grid可以有B個(gè)bounding box，但是這些bounding box必須是一個(gè)類別，所以受到很大的限制，實(shí)際作用效果一般。
2.因?yàn)槭侨Τ鰜砦矬w的大小，受到畫面拉伸的影響。
3.大小的bounding box對(duì)相同大小的偏差的敏感程度不同，作者這里實(shí)際上已經(jīng)在損失函數(shù)中補(bǔ)救了，文章中表述的不足應(yīng)當(dāng)指的是當(dāng)前的補(bǔ)足效果并不是最好，使得模型的主要錯(cuò)誤來源還是框的位置定位錯(cuò)誤。

3.Comparison to Other Detection Systems

3.1逐句翻譯

第一段（大約就是陳述了之前的都是分成兩個(gè)過程來走）

Object detection is a core problem in computer vision.（這里其實(shí)就是我們常說的CV）
目標(biāo)檢測(cè)是計(jì)算機(jī)視覺中的一個(gè)核心問題。

Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]).
檢測(cè)管道通常首先從輸入圖像中提取一組魯棒特征。（括號(hào)里的文章都是這么做的）

Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space.
然后，使用分類器[36,21,13,10]或定位器[1,32]來識(shí)別特征空間中的對(duì)象。

These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39].
這些分類器或定位器以滑動(dòng)窗口的方式在整個(gè)圖像或圖像中的一些區(qū)域子集上運(yùn)行[35,15,39]。

We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
我們將YOLO檢測(cè)系統(tǒng)與幾個(gè)頂級(jí)檢測(cè)框架進(jìn)行了比較，突出了關(guān)鍵的相似點(diǎn)和不同點(diǎn)。（就是把本文的YOLO和她們比較一下，看看有什么相似和不同。）

第二段（）

Deformable parts models (DPM) use a sliding window approach to object detection[10].
Deformable parts models使用滑動(dòng)窗口方法來檢測(cè)目標(biāo)[10]

DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc.
Our system replaces all of these disparate parts with a single convolutional neural network.
The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently.
Instead of static features, the network trains the features in-line and optimizes them for the detection task.
Our unified architecture leads to a faster, more accurate model than DPM.
（這里都是古人的研究，之后再看，先看看yolo的研究?jī)?nèi)容）

4. Experiments

4.0 寫在前面

4.0.1逐句翻譯

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007.
首先，我們?cè)赑ASCAL VOC 2007上對(duì)YOLO與其他實(shí)時(shí)檢測(cè)系統(tǒng)進(jìn)行了比較。

To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14].
為了理解YOLO和R-CNN變體之間的差異，我們探索了YOLO和Fast R-CNN (R-CNN表現(xiàn)最好的版本之一)對(duì)VOC 2007的錯(cuò)誤。

Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost.
基于不同的錯(cuò)誤配置，我們表明YOLO可以用于對(duì)Fast R-CNN檢測(cè)進(jìn)行重新評(píng)分，并減少來自背景誤報(bào)的錯(cuò)誤，從而顯著提高性能。

We also present VOC 2012 results and compare mAP to current state-of-the-art methods.
我們還在VOC2012上做了試驗(yàn)，并將結(jié)果的mAP與目前最先進(jìn)的方法進(jìn)行了比較。

Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
最后，我們證明了YOLO比其他檢測(cè)器在兩個(gè)圖像數(shù)據(jù)集上更好地泛化新領(lǐng)域。

4.0.2總結(jié)

1.試驗(yàn)使用了VOC2007和VOC2012數(shù)據(jù)集進(jìn)行測(cè)試
2.因?yàn)闆]有state-of-art的效果好，作者用了很狡猾的描述：YOLO在背景上錯(cuò)誤少，所以更加好用。

4.1 Comparison to Other Real-Time Systems（和最近的研究相比）

第一段（之前真正的實(shí)時(shí)性系統(tǒng)并不多，只有幾個(gè)，就算不是實(shí)時(shí)系統(tǒng)我們也做了對(duì)比實(shí)驗(yàn)評(píng)估m(xù)AP降低換來的時(shí)間提升劃算嗎）

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28]
許多目標(biāo)檢測(cè)的研究工作都集中在使標(biāo)準(zhǔn)檢測(cè)管道快速。
（就是之前的研究都是在原來模型上進(jìn)行加速，而不打破原來的架構(gòu)）

However, only Sadeghi et al. actually produce a deection system that runs in real-time (30 frames per second or better) [31].
只有Sadeghi等人真正產(chǎn)生了實(shí)時(shí)運(yùn)行的檢測(cè)系統(tǒng)(每秒30幀或更好)。

We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz.
我們比較了YOLO與運(yùn)行在30Hz或100Hz的DPM的GPU實(shí)現(xiàn)。
（就是和DPM運(yùn)行在30幀或是100幀的硬件條件下看看YOLO大約怎么樣）

While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
雖然其他的工作沒有達(dá)到實(shí)時(shí)的里程碑，我們也比較了它們的相對(duì)mAP和速度，以檢查在目標(biāo)檢測(cè)系統(tǒng)中可用的準(zhǔn)確性和性能折衷。

第二段（從Fast YOLO引出YOLO真快）

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector.
Fast YOLO是PASCAL上最快的目標(biāo)檢測(cè)方法;據(jù)我們所知，它是現(xiàn)存最快的物體探測(cè)器。

With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection.
它的mAP值為52.7%，但是卻可以比之前的實(shí)時(shí)檢測(cè)精度高出一倍多。

YOLO pushes mAP to 63.4% while still maintaining real-time performance.
YOLO將mAP提升到63.4%，同時(shí)仍然保持實(shí)時(shí)性能。

第三段（和VGG16的結(jié)合雖然準(zhǔn)但是慢，其實(shí)真正的YOLO比VGG淺）

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO.
我們也用VGG-16訓(xùn)練YOLO。這個(gè)模型比YOLO更準(zhǔn)確，但也明顯慢。

It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper（這里沒有s，特指這個(gè)這篇論文） focuses on our faster models.
這對(duì)于與其他依賴VGG-16的檢測(cè)系統(tǒng)進(jìn)行比較很有用，但由于它比實(shí)時(shí)更慢，本文的其余部分將重點(diǎn)放在我們更快的模型上。

（這里說一下，看一下yolov1的代碼就可以看到，他也是卷積、標(biāo)準(zhǔn)化、激活、池化，只是比ygg16的層數(shù)少）

第四段（R-CNN minus R的失敗）

R-CNN minus R replaces Selective Search with static bounding box proposals [20].
R- cnn 減 R用靜態(tài)邊界框proposal替換選擇性搜索。

While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
雖然它比R-CNN快得多，但它仍然缺乏實(shí)時(shí)性，而且由于沒有好的提案，它的準(zhǔn)確性受到了很大的打擊。

第五段（Fast R-CNN雖然很快但是仍有延遲）

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals.
Fast R-CNN加快了R-CNN的分類階段，但它仍然依賴于選擇性搜索，每幅圖像大約需要2秒來生成邊界框建議。

Thus it has high mAP but at 0.5 fps it is still far from realtime.
因此它有很高的mAP，但在0.5 fps的情況下，它離實(shí)時(shí)性還差得很遠(yuǎn)。

第六段（ Faster R-CNN The Zeiler-Fergus Faster R-CNN 雖然可以更快，都沒有YOLO這么準(zhǔn)）

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8]
最近的Faster R-CNN用神經(jīng)網(wǎng)絡(luò)代替了選擇性搜索，提出了邊界框，類似于Szegedy等人。

In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps.
在我們的測(cè)試中，他們最精確的模型達(dá)到7幀/秒，而更小、更不精確的模型運(yùn)行在18幀/秒。

The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO.
Faster-R-CNN的VGG-16版本比YOLO高10 mAP，但也慢6倍。

The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
Zeiller-Fergus的 Faster R-CNN只比YOLO慢2.5倍，但也不太準(zhǔn)確。

4.2. VOC 2007 Error Analysis

第一段

4.6總結(jié)

試驗(yàn)大約分成幾個(gè)部分：
1.和最近的研究相比，YOLO雖然精度上可能有所欠缺，但是他快啊。

5.致歉

由于有其他的緊急項(xiàng)目，這里的學(xué)習(xí)暫時(shí)告一段里，所以沒有完成YOLOv1的閱讀

總結(jié)

以上是生活随笔為你收集整理的YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：什么是mAP?
下一篇：什么是pretext tasks？