當(dāng)前位置：首頁(yè) > 人工智能 > pytorch >内容正文

pytorch

深度学习之对象检测_深度学习时代您应该阅读的12篇文章，以了解对象检测

發(fā)布時(shí)間：2023/12/15 pytorch 41 豆豆

生活随笔收集整理的這篇文章主要介紹了深度学习之对象检测_深度学习时代您应该阅读的12篇文章，以了解对象检测小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

深度學(xué)習(xí)之對(duì)象檢測(cè)

前言 (Foreword)

As the second article in the “Papers You Should Read” series, we are going to walk through both the history and some recent developments in a more difficult area of computer vision research: object detection. Before the deep learning era, hand-crafted features like HOG and feature pyramids are used pervasively to capture localization signals in an image. However, those methods usually can’t extend to generic object detection well, so most of the applications are limited to face or pedestrian detections. With the power of deep learning, we can train a network to learn which features to capture, as well as what coordinates to predict for an object. And this eventually led to a boom of applications based on visual perception, such as the commercial face recognition system and autonomous vehicle. In this article, I picked 12 must-read papers for newcomers who want to study object detection. Although the most challenging part of building an object detection system hides in the implementation details, reading these papers can still give you a good high-level understanding of where the ideas come from, and how would object detection evolve in the future.

作為“您應(yīng)該閱讀的論文”系列的第二篇文章，我們將探討計(jì)算機(jī)視覺(jué)研究這一更加困難的領(lǐng)域的歷史和最近的一些發(fā)展：對(duì)象檢測(cè)。在深度學(xué)習(xí)時(shí)代之前，像HOG和金字塔特征之類(lèi)的手工特征已廣泛用于捕獲圖像中的定位信號(hào)。但是，這些方法通常不能很好地?cái)U(kuò)展到通用對(duì)象檢測(cè)，因此大多數(shù)應(yīng)用程序僅限于人臉或行人檢測(cè)。借助深度學(xué)習(xí)的力量，我們可以訓(xùn)練網(wǎng)絡(luò)來(lái)學(xué)習(xí)要捕獲的特征以及預(yù)測(cè)對(duì)象的坐標(biāo)。最終，這導(dǎo)致了基于視覺(jué)感知的應(yīng)用程序的繁榮，例如商用人臉識(shí)別系統(tǒng)和自動(dòng)駕駛汽車(chē)。在本文中，我為想要研究對(duì)象檢測(cè)的新手挑選了12篇必讀的論文。盡管構(gòu)建對(duì)象檢測(cè)系統(tǒng)最具挑戰(zhàn)性的部分隱藏在實(shí)現(xiàn)細(xì)節(jié)中，但閱讀這些文章仍可以使您對(duì)這些思想的來(lái)龍去脈以及對(duì)象檢測(cè)在未來(lái)的發(fā)展中有一個(gè)很好的高級(jí)了解。

As a prerequisite for reading this article, you need to know the basic idea of the convolution neural network and the common optimization method such as gradient descent with back-propagation. It’s also highly recommended to read my previous article “10 Papers You Should Read to Understand Image Classification in the Deep Learning Era” first because many cool ideas of object detection originate from a more fundamental image classification research.

作為閱讀本文的先決條件，您需要了解卷積神經(jīng)網(wǎng)絡(luò)的基本概念以及常見(jiàn)的優(yōu)化方法，例如帶有反向傳播的梯度下降。強(qiáng)烈建議先閱讀我以前的文章“ 在深度學(xué)習(xí)時(shí)代應(yīng)該讀懂的10篇論文，以了解圖像分類(lèi) ”，因?yàn)樵S多很酷的對(duì)象檢測(cè)思想都來(lái)自于更基礎(chǔ)的圖像分類(lèi)研究。

2013年 (2013: OverFeat)

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

OverFeat：使用卷積網(wǎng)絡(luò)的集成識(shí)別，定位和檢測(cè)

OverFeat: Integrated Recognition, Localization and Detection using Convolutional NetworksOverFeat：使用卷積網(wǎng)絡(luò)的集成識(shí)別，定位和檢測(cè)””

Inspired by the early success of AlexNet in the 2012 ImageNet competition, where CNN-based feature extraction defeated all hand-crafted feature extractors, OverFeat quickly introduced CNN back into the object detection area as well. The idea is very straight forward: if we can classify one image using CNN, what about greedily scrolling through the whole image with different sizes of windows, and try to regress and classify them one-by-one using a CNN? This leverages the power of CNN for feature extraction and classification, and also bypassed the hard region proposal problem by pre-defined sliding windows. Also, since a nearby convolution kernel can share part of the computation result, it is not necessary to compute convolutions for the overlapping area, hence reducing cost a lot. OverFeat is a pioneer in the one-stage object detector. It tried to combine feature extraction, location regression, and region classification in the same CNN. Unfortunately, such a one-stage approach also suffers from relatively poorer accuracy due to less prior knowledge used. Thus, OverFeat failed to lead a hype for one-stage detector research, until a much more elegant solution coming out 2 years later.

受AlexNet在2012年ImageNet競(jìng)賽中早期成功的啟發(fā)，基于CNN的特征提取擊敗了所有手工制作的特征提取器，OverFeat很快也將CNN引入了對(duì)象檢測(cè)領(lǐng)域。這個(gè)想法很簡(jiǎn)單：如果我們可以使用CNN對(duì)一個(gè)圖像進(jìn)行分類(lèi)，那么如何在不同大小的窗口中貪婪地滾動(dòng)瀏覽整個(gè)圖像，然后嘗試使用CNN逐一進(jìn)行回歸和分類(lèi)呢？這利用了CNN進(jìn)行特征提取和分類(lèi)的功能，還通過(guò)預(yù)定義的滑動(dòng)窗口繞過(guò)了硬區(qū)域建議問(wèn)題。另外，由于附近的卷積核可以共享一部分計(jì)算結(jié)果，因此不必為重疊區(qū)域計(jì)算卷積，因此大大降低了成本。 OverFeat是一級(jí)目標(biāo)檢測(cè)器的先驅(qū)。它試圖在同一CNN中結(jié)合特征提取，位置回歸和區(qū)域分類(lèi)。不幸的是，由于使用的先驗(yàn)知識(shí)較少，這種單階段方法還遭受相對(duì)較差的準(zhǔn)確性。因此，OverFeat未能引起對(duì)一級(jí)探測(cè)器研究的大肆宣傳，直到兩年后出現(xiàn)了更為優(yōu)雅的解決方案。

2013年：R-CNN (2013: R-CNN)

Region-based Convolutional Networks for Accurate Object Detection and Segmentation

基于區(qū)域的卷積網(wǎng)絡(luò)，用于精確的目標(biāo)檢測(cè)和分割

Also proposed in 2013, R-CNN is a bit late compared with OverFeat. However, this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e, region proposal stage, and region classification and refinement stage.

同樣在2013年提出的R-CNN與OverFeat相比有點(diǎn)晚。然而，這種基于區(qū)域的方法最終以其兩個(gè)階段的框架，即區(qū)域提議階段，區(qū)域分類(lèi)和細(xì)化階段，引起了對(duì)象檢測(cè)研究的熱潮。

Region-based Convolutional Networks for Accurate Object Detection and Segmentation用于精確目標(biāo)檢測(cè)和分割的基于區(qū)域的卷積網(wǎng)絡(luò)

In the above diagram, R-CNN first extracts potential regions of interest from an input image by using a technique called selective search. Selective search doesn’t really try to understand the foreground object, instead, it groups similar pixels by relying on a heuristic: similar pixels usually belong to the same object. Therefore, the results of selective search have a very high probability to contain something meaningful. Next, R-CNN warps these region proposals into fixed-size images with some paddings, and feed these images into the second stage of the network for more fine-grained recognition. Unlike those old methods using selective search, R-CNN replaced HOG with a CNN to extract features from all region proposals in its second stage. One caveat of this approach is that many region proposals are not really a full object, so R-CNN needs to not only learn to classify the right classes, but also learn to reject the negative ones. To solve this problem, R-CNN treated all region proposals with a ≥ 0.5 IoU overlap with a ground-truth box as positive, and the rest as negatives.

在上圖中，R-CNN首先使用稱(chēng)為選擇性搜索的技術(shù)從輸入圖像中提取潛在的感興趣區(qū)域。選擇性搜索并沒(méi)有真正嘗試?yán)斫馇熬皩?duì)象，而是依靠啟發(fā)式方法將相似像素進(jìn)行分組：相似像素通常屬于同一對(duì)象。因此，選擇性搜索的結(jié)果很有可能包含有意義的內(nèi)容。接下來(lái)，R-CNN將這些區(qū)域建議變形為帶有一些填充的固定大小的圖像，并將這些圖像饋入網(wǎng)絡(luò)的第二階段以進(jìn)行更細(xì)粒度的識(shí)別。與那些使用選擇性搜索的舊方法不同，R-CNN在第二階段將HOG替換為CNN，以從所有區(qū)域提案中提取特征。這種方法的一個(gè)警告是，許多區(qū)域提議并不是真正的完整對(duì)象，因此R-CNN不僅需要學(xué)習(xí)對(duì)正確的類(lèi)別進(jìn)行分類(lèi)，而且還需要拒絕否定的類(lèi)別。為解決此問(wèn)題，R-CNN將所有≥0.5 IoU重疊且與地面真實(shí)框重疊的區(qū)域提案視為正，其余部分視為負(fù)面。

Region proposal from selective search highly depends on the similarity assumption, so it can only provide a rough estimate of location. To further improve localization accuracy, R-CNN borrowed an idea from “Deep Neural Networks for Object Detection” (aka DetectorNet), and introduced an additional bounding box regression to predict the center coordinates, width and height of a box. This regressor is widely used in the future object detectors.

選擇性搜索的區(qū)域提議在很大程度上取決于相似性假設(shè)，因此它只能提供位置的粗略估計(jì)。為了進(jìn)一步提高定位精度，R-CNN借鑒了“用于物體檢測(cè)的深度神經(jīng)網(wǎng)絡(luò)”(又名DetectorNet)的思想，并引入了附加的包圍盒回歸來(lái)預(yù)測(cè)盒子的中心坐標(biāo)，寬度和高度。該回歸器在未來(lái)的物體檢測(cè)器中被廣泛使用。

However, a two-stage detector like R-CNN suffers from two big issues: 1) It’s not fully convolutional because selective search is not E2E trainable. 2) region proposal stage is usually very slow compared with other one-stage detectors like OverFeat, and running on each region proposal separately makes it even slower. Later, we will see how R-CNN evolve over time to address these two issues.

但是，像R-CNN這樣的兩級(jí)檢測(cè)器存在兩個(gè)大問(wèn)題：1)它不是完全卷積的，因?yàn)檫x擇性搜索不可E2E訓(xùn)練。 2)與其他單階段檢測(cè)器(如OverFeat)相比，區(qū)域提議階段通常非常慢，并且在每個(gè)區(qū)域提議上單獨(dú)運(yùn)行會(huì)使其變得更慢。稍后，我們將了解R-CNN如何隨著時(shí)間的發(fā)展而發(fā)展，以解決這兩個(gè)問(wèn)題。

2015年：快速R-CNN (2015: Fast R-CNN)

Fast R-CNN

快速R-CNN

Fast R-CNN”Fast R-CNN ”

A quick follow-up for R-CNN is to reduce the duplicate convolution over multiple region proposals. Since these region proposals all come from one image, it’s naturally to improve R-CNN by running CNN over the entire image once and share the computation among many region proposals. However, different region proposals have different sizes, which also result in different output feature map sizes if we are using the same CNN feature extractor. These feature maps with various sizes will prevent us from using fully connected layers for further classification and regression because the FC layer only works with a fixed size input.

R-CNN的快速跟進(jìn)是減少多個(gè)區(qū)域提案之間的重復(fù)卷積。由于這些區(qū)域提案全部來(lái)自一張圖像，因此自然可以通過(guò)在整個(gè)圖像上運(yùn)行一次CNN來(lái)改善R-CNN，并在許多區(qū)域提案中共享計(jì)算。但是，不同的區(qū)域建議具有不同的大小，如果我們使用相同的CNN特征提取器，這也會(huì)導(dǎo)致不同的輸出特征圖大小。這些具有各種尺寸的要素圖將阻止我們使用完全連接的圖層進(jìn)行進(jìn)一步的分類(lèi)和回歸，因?yàn)镕C圖層僅適用于固定大小的輸入。

Fortunately, a paper called “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” has already solved the dynamic scale issue for FC layers. In SPPNet, a feature pyramid pooling is introduced between convolution layers and FC layers to create a bag-of-words style of the feature vector. This vector has a fixed size and encodes features from different scales, so our convolution layers can now take any size of images as input without worrying about the incompatibility of the FC layer. Inspired by this, Fast R-CNN proposed a similar layer call the ROI Pooling layer. This pooling layer downsamples feature maps with different sizes into a fixed-size vector. By doing so, we can now use the same FC layers for classification and box regression, no matter how large or small the ROI is.

幸運(yùn)的是，名為“用于視覺(jué)識(shí)別的深度卷積網(wǎng)絡(luò)中的空間金字塔池化”的論文已經(jīng)解決了FC層的動(dòng)態(tài)縮放問(wèn)題。在SPPNet中，在卷積層和FC層之間引入了特征金字塔池，以創(chuàng)建特征向量的詞袋樣式。此向量具有固定的大小，并編碼不同比例的特征，因此我們的卷積層現(xiàn)在可以將任何大小的圖像用作輸入，而不必?fù)?dān)心FC層的不兼容性。受此啟發(fā)，Fast R-CNN提出了一個(gè)類(lèi)似的層，稱(chēng)為ROI Pooling層。此池化層下采樣將具有不同大小的特征圖轉(zhuǎn)換為固定大小的向量。這樣，無(wú)論ROI大小如何，我們現(xiàn)在都可以使用相同的FC層進(jìn)行分類(lèi)和框回歸。

With a shared feature extractor and the scale-invariant ROI pooling layer, Fast R-CNN can reach a similar localization accuracy but having 10~20x faster training and 100~200x faster inference. The near real-time inference and an easier E2E training protocol for the detection part make Fast R-CNN a popular choice in the industry as well.

借助共享的特征提取器和尺度不變的ROI合并層，Fast R-CNN可以達(dá)到類(lèi)似的定位精度，但訓(xùn)練速度提高10到20倍，推理速度提高100到200倍。接近實(shí)時(shí)的推理和用于檢測(cè)部分的更輕松的端到端培訓(xùn)協(xié)議使Fast R-CNN成為行業(yè)中的流行選擇。

You Only Look Once: Unified, Real-Time Object Detection您只看一次：統(tǒng)一的實(shí)時(shí)對(duì)象檢測(cè)””

This dense prediction over the entire image can cause trouble in computation cost, so YOLO took the bottleneck structure from GooLeNet to avoid this issue. Another problem of YOLO is that two objects might fall into the same coarse grid cell, so it doesn’t work well with small objects such as a flock of birds. Despite lower accuracy, YOLO’s straightforward design and real-time inference ability makes one-stage object detection popular again in the research, and also a go-to solution for the industry.

這種對(duì)整個(gè)圖像的密集預(yù)測(cè)會(huì)導(dǎo)致計(jì)算成本出現(xiàn)問(wèn)題，因此YOLO采取了GooLeNet的瓶頸結(jié)構(gòu)來(lái)避免此問(wèn)題。 YOLO的另一個(gè)問(wèn)題是兩個(gè)物體可能會(huì)落入同一個(gè)粗網(wǎng)格單元中，因此它不適用于鳥(niǎo)群等較小的物體。盡管精度較低，但YOLO的直接設(shè)計(jì)和實(shí)時(shí)推理能力使一階段目標(biāo)檢測(cè)在研究中再次受到歡迎，并且也是行業(yè)首選的解決方案。

2015年：更快的R-CNN (2015: Faster R-CNN)

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

更快的R-CNN：通過(guò)區(qū)域提議網(wǎng)絡(luò)實(shí)現(xiàn)實(shí)時(shí)目標(biāo)檢測(cè)

As we introduced above, in early 2015, Ross Girshick proposed an improved version of R-CNN called Fast R-CNN by using a shared feature extractor for proposed regions. Just a few months later, Ross and his team came back with another improvement again. This new network Faster R-CNN is not only faster than previous versions but also marks a milestone for object detection with a deep learning method.

正如我們上文所述，2015年初，羅斯·吉爾希克(Ross Girshick)通過(guò)使用提議區(qū)域的共享特征提取器，提出了一種改進(jìn)的R-CNN版本，稱(chēng)為快速R-CNN。僅僅幾個(gè)月后，羅斯和他的團(tuán)隊(duì)又回來(lái)了，又有了另一個(gè)改進(jìn)。這個(gè)新的網(wǎng)絡(luò)Faster R-CNN不僅比以前的版本快，而且標(biāo)志著使用深度學(xué)習(xí)方法進(jìn)行對(duì)象檢測(cè)的里程碑。

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks更快的R-CNN：通過(guò)區(qū)域提議網(wǎng)絡(luò)實(shí)現(xiàn)實(shí)時(shí)目標(biāo)檢測(cè)””

With Fast R-CNN, the only non-convolutional piece of the network is the selective search region proposal. As of 2015, researchers started to realize that the deep neural network is so magical, that it can learn anything given enough data. So, is it possible to also train a neural network to proposal regions, instead of relying on heuristic and hand-crafted approach like selective search? Faster R-CNN followed this direction and thinking, and successfully created the Region Proposal Network (RPN). To simply put, RPN is a CNN that takes an image as input and outputs a set of rectangular object proposals, each with an objectiveness score. The paper used VGG originally but other backbone networks such as ResNet become more widespread later. To generate region proposals, a 3x3 sliding window is applied over the CNN feature map output to generate 2 scores (foreground and background) and 4 coordinates each location. In practice, this sliding window is implemented with a 3x3 convolution kernel with a 1x1 convolution kernel.

使用Fast R-CNN，網(wǎng)絡(luò)中唯一的非卷積部分是選擇性搜索區(qū)域建議。從2015年開(kāi)始，研究人員開(kāi)始意識(shí)到深度神經(jīng)網(wǎng)絡(luò)是如此神奇，以至于只要有足夠的數(shù)據(jù)就可以學(xué)習(xí)任何東西。因此，是否有可能將神經(jīng)網(wǎng)絡(luò)訓(xùn)練到投標(biāo)區(qū)域，而不是依靠像選擇搜索那樣的啟發(fā)式和手工方法？更快的R-CNN遵循了這個(gè)方向和思想，并成功創(chuàng)建了區(qū)域提案網(wǎng)絡(luò)(RPN)。簡(jiǎn)而言之，RPN是一個(gè)CNN，它以圖像作為輸入并輸出一組矩形的對(duì)象建議，每個(gè)對(duì)象建議都有一個(gè)客觀評(píng)分。本文最初使用VGG，但后來(lái)其他類(lèi)似ResNet的骨干網(wǎng)絡(luò)變得更加普及。為了生成區(qū)域建議，將3x3滑動(dòng)窗口應(yīng)用于CNN特征圖輸出，以生成2個(gè)得分(前景和背景)以及每個(gè)位置4個(gè)坐標(biāo)。實(shí)際上，此滑動(dòng)窗口是通過(guò)3x3卷積內(nèi)核和1x1卷積內(nèi)核實(shí)現(xiàn)的。

Although the sliding window has a fixed size, our objects may appear on different scales. Therefore, Faster R-CNN introduced a technique called anchor box. Anchor boxes are pre-defined prior boxes with different aspect ratios and sizes but share the same central location. In Faster R-CNN there are k=9 anchors for each sliding window location, which covers 3 aspect ratios for 3 scales each. These repeated anchor boxes over different scales bring nice translation-invariance and scale-invariance features to the network while sharing outputs of the same feature map. Note that the bounding box regression will be computed from these anchor box instead of the whole image.

盡管滑動(dòng)窗口的大小是固定的，但我們的對(duì)象可能會(huì)以不同的比例出現(xiàn)。因此，Faster R-CNN引入了一種稱(chēng)為錨框的技術(shù)。錨框是預(yù)先定義的先驗(yàn)框，具有不同的縱橫比和大小，但共享相同的中心位置。在Faster R-CNN中，每個(gè)滑動(dòng)窗口位置都有k = 9個(gè)錨點(diǎn)，每個(gè)錨點(diǎn)覆蓋3個(gè)比例的3個(gè)寬高比。這些在不同比例尺上重復(fù)出現(xiàn)的錨定框?yàn)榫W(wǎng)絡(luò)帶來(lái)了很好的平移不變性和比例不變性特征，同時(shí)共享了相同特征圖的輸出。注意，將根據(jù)這些錨定框而不是整個(gè)圖像計(jì)算邊界框回歸。

So far, we discussed the new Region Proposal Network to replace the old selective search region proposal. To make the final detection, Faster R-CNN uses the same detection head from Fast R-CNN to do classification and fine-grained localization. Do you remember that Fast R-CNN also uses a shared CNN feature extractor? Now that RPN itself is also a feature extraction CNN, we can just share it with detection head like the diagram above. This sharing design doesn’t bring some trouble though. If we train RPN and Fast R-CNN detector together, we will treat RPN proposals as a constant input of ROI pooling, and inevitably ignore the gradients of RPN’s bounding box proposals. One walk around is called alternative training where you train RPN and Fast R-CNN in turns. And later in a paper “Instance-aware semantic segmentation via multi-task network cascades”, we can see that the ROI pooling layer can also be made differentiable w.r.t. the box coordinates proposals.

到目前為止，我們討論了新的區(qū)域提議網(wǎng)絡(luò)以取代舊的選擇性搜索區(qū)域提議。為了進(jìn)行最終檢測(cè)，Faster R-CNN使用與Fast R-CNN相同的檢測(cè)頭進(jìn)行分類(lèi)和細(xì)粒度定位。您還記得Fast R-CNN也使用共享的CNN特征提取器嗎？現(xiàn)在，RPN本身也是一個(gè)特征提取CNN，我們可以像上圖一樣與檢測(cè)頭共享它。這種共享設(shè)計(jì)不會(huì)帶來(lái)任何麻煩。如果我們一起訓(xùn)練RPN和Fast R-CNN檢測(cè)器，我們會(huì)將RPN建議視為ROI池的不變輸入，并且不可避免地會(huì)忽略RPN邊界框建議的梯度。繞一圈稱(chēng)為替代訓(xùn)練，您可以依次訓(xùn)練RPN和Fast R-CNN。稍后在“通過(guò)多任務(wù)網(wǎng)絡(luò)級(jí)聯(lián)進(jìn)行實(shí)例感知的語(yǔ)義分割”一文中，我們可以看到，在框協(xié)調(diào)提議的情況下，也可以使ROI池化層具有差異性。

2015年：YOLO v1 (2015: YOLO v1)

You Only Look Once: Uni?ed, Real-Time Object Detection

您只需看一次即可：統(tǒng)一的實(shí)時(shí)對(duì)象檢測(cè)

While the R-CNN series started a big hype over two-stage object detection in the research community, its complicated implementation brought many headaches for engineers who maintain it. Does object detection need to be so cumbersome? If we are willing to sacrifice a bit of accuracy, can we trade for much faster speed? With these questions, Joseph Redmon submitted a network called YOLO to arxiv.org only four days after Faster R-CNN’s submission and finally brought popularity back to one-stage object detection two years after OverFeat’s debut.

盡管R-CNN系列在研究界開(kāi)始大肆宣傳兩階段目標(biāo)檢測(cè)，但其復(fù)雜的實(shí)現(xiàn)卻給維護(hù)它的工程師帶來(lái)了許多麻煩。對(duì)象檢測(cè)是否需要這么麻煩？如果我們?cè)敢鉅奚稽c(diǎn)準(zhǔn)確性，我們可以以更快的速度進(jìn)行交易嗎？有了這些問(wèn)題，約瑟夫·雷德蒙(Joseph Redmon)在Faster R-CNN提交后僅四天就向arxiv.org提交了一個(gè)名為YOLO的網(wǎng)絡(luò)，并在OverFeat首次亮相兩年后，終于將流行度恢復(fù)到了一個(gè)階段的對(duì)象檢測(cè)。

You Only Look Once: Unified, Real-Time Object Detection您只看一次：統(tǒng)一的實(shí)時(shí)對(duì)象檢測(cè)””

Unlike R-CNN, YOLO decided to tackle region proposal and region classification together in the same CNN. In other words, it treats object detection as a regression problem, instead of a classification problem relying on region proposals. The general idea is to split the input into an SxS grid and having each cell directly regress the bounding box location and the confidence score if the object center falls into that cell. Because objects may have different sizes, there will be more than one bounding box regressor per cell. During training, the regressor with the highest IOU will be assigned to compare with the ground-truth label, so regressors at the same location will learn to handle different scales over time. In the meantime, each cell will also predict C class probabilities, conditioned on the grid cell containing an object (high confidence score). This approach is later described as dense predictions because YOLO tried to predict classes and bounding boxes for all possible locations in an image. In contrast, R-CNN relies on region proposals to filter out background regions, hence the final predictions are much more sparse.

與R-CNN不同，YOLO決定在同一CNN中一起處理區(qū)域提議和區(qū)域分類(lèi)。換句話說(shuō)，它將對(duì)象檢測(cè)視為回歸問(wèn)題，而不是依賴(lài)于區(qū)域提議的分類(lèi)問(wèn)題。一般的想法是將輸入分成一個(gè)SxS網(wǎng)格，并在對(duì)象中心落入該單元格時(shí)讓每個(gè)單元格直接回歸邊界框位置和置信度分?jǐn)?shù)。由于對(duì)象的大小可能不同，因此每個(gè)單元將有一個(gè)以上的包圍盒回歸器。在訓(xùn)練期間，將分配具有最高IOU的回歸變量與地面真實(shí)性標(biāo)簽進(jìn)行比較，因此同一位置的回歸變量將隨著時(shí)間的推移學(xué)會(huì)處理不同的音階。同時(shí)，每個(gè)單元還將根據(jù)包含對(duì)象的網(wǎng)格單元(高置信度得分)來(lái)預(yù)測(cè)C類(lèi)概率。后來(lái)，這種方法被稱(chēng)為密集預(yù)測(cè)，因?yàn)閅OLO試圖預(yù)測(cè)圖像中所有可能位置的類(lèi)和邊界框。相比之下，R-CNN依靠區(qū)域提議來(lái)濾除背景區(qū)域，因此最終的預(yù)測(cè)要稀疏得多。

2015年：SSD (2015: SSD)

SSD: Single Shot MultiBox Detector

SSD：單發(fā)MultiBox檢測(cè)器

YOLO v1 demonstrated the potentials of one-stage detection, but the performance gap from two-stage detection is still noticeable. In YOLO v1, multiple objects could be assigned to the same grid cell. This was a big challenge when detecting small objects, and became a critical problem to solve in order to improve a one-stage detector’s performance to be on par with two-stage detectors. SSD is such a challenger and attacks this problem from three angles.

YOLO v1展示了一級(jí)檢測(cè)的潛力，但是與二級(jí)檢測(cè)相比的性能差距仍然很明顯。在YOLO v1中，可以將多個(gè)對(duì)象分配給同一網(wǎng)格單元。當(dāng)檢測(cè)小物體時(shí)，這是一個(gè)很大的挑戰(zhàn)，并且成為要提高一級(jí)檢測(cè)器的性能以使其與兩級(jí)檢測(cè)器相提并論的關(guān)鍵問(wèn)題。 SSD就是這樣的挑戰(zhàn)者，并從三個(gè)角度解決這個(gè)問(wèn)題。

SSD: Single Shot MultiBox DetectorSSD：單發(fā)多盒檢測(cè)器””

First, the anchor box technique from Faster R-CNN can alleviate this problem. Objects in the same area usually come with different aspect ratios to be visible. Introducing anchor box not only increased the amount of object to detect for each cell, but also helped the network to better differentiate overlapping small objects with this aspect ratio assumption.

首先，Faster R-CNN的錨框技術(shù)可以緩解此問(wèn)題。同一區(qū)域中的對(duì)象通常具有不同的長(zhǎng)寬比以使其可見(jiàn)。引入錨框不僅增加了每個(gè)單元要檢測(cè)的對(duì)象數(shù)量，而且還通過(guò)這種長(zhǎng)寬比假設(shè)幫助網(wǎng)絡(luò)更好地區(qū)分了重疊的小對(duì)象。

SSD: Single Shot MultiBox DetectorSSD：單發(fā)多盒檢測(cè)器””

SSD went down on this road further by aggregating multi-scale features before detection. This is a very common approach to pick up fine-grained local features while preserving coarse global features in CNN. For example, FCN, the pioneer of CNN semantic segmentation, also merged features from multiple levels to refine the segmentation boundary. Besides, multi-scale feature aggregation can be easily performed on all common classification networks, so it’s very convenient to swap out the backbone with another network.

在檢測(cè)之前，SSD通過(guò)整合多尺度功能進(jìn)一步走上了這條道路。這是在保留CNN中粗粒度全局特征的同時(shí)獲取細(xì)粒度局部特征的一種非常常用的方法。例如，CNN語(yǔ)義分割的先驅(qū)FCN也合并了多個(gè)級(jí)別的特征以完善分割邊界。此外，可以在所有常見(jiàn)分類(lèi)網(wǎng)絡(luò)上輕松執(zhí)行多尺度特征聚合，因此將主干與另一個(gè)網(wǎng)絡(luò)交換出去非常方便。

Finally, SSD leveraged a large amount of data augmentation, especially targeted to small objects. For example, images are randomly expanded to a much larger size before random cropping, which brings a zoom-out effect to the training data to simulate small objects. Also, large bounding boxes are usually easy to learn. To avoid these easy examples dominating the loss, SSD adopted a hard negative mining technique to pick examples with the highest loss for each anchor box.

最后，SSD利用了大量的數(shù)據(jù)擴(kuò)充功能，尤其是針對(duì)小型對(duì)象。例如，在隨機(jī)裁剪之前將圖像隨機(jī)擴(kuò)展到更大的尺寸，這會(huì)給訓(xùn)練數(shù)據(jù)帶來(lái)縮小效果以模擬小物體。而且，大型邊界框通常易于學(xué)習(xí)。為了避免這些簡(jiǎn)單的示例占主導(dǎo)地位的損失，SSD采用硬性負(fù)挖礦技術(shù)為每個(gè)錨定框選擇損失最高的示例。

2016年：FPN (2016: FPN)

Feature Pyramid Networks for Object Detection

用于目標(biāo)檢測(cè)的特征金字塔網(wǎng)絡(luò)

With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like the general structure an object detector is determined. Researchers start to look at improving each individual parts of these networks. Feature Pyramid Networks is an attempt to improve the detection head by using features from different layers to form a feature pyramid. This feature pyramid idea isn’t very novel in computer vision research. Back then when features are still manually designed, feature pyramid is already a very effective way to recognize patterns at different scales. Using the Feature Pyramid in deep learning is also not a new idea: SSPNet, FCN, and SSD all demonstrated the benefit of aggregating multiple-layer features before classification. However, how to share the feature pyramid between RPN and the region-based detector is still yet to be determined.

隨著2015年Faster-RCNN，YOLO和SSD的推出，似乎確定了對(duì)象檢測(cè)器的一般結(jié)構(gòu)。研究人員開(kāi)始考慮改進(jìn)這些網(wǎng)絡(luò)的各個(gè)部分。特征金字塔網(wǎng)絡(luò)是通過(guò)使用來(lái)自不同層的特征以形成特征金字塔來(lái)改進(jìn)檢測(cè)頭的嘗試。這個(gè)特征金字塔的想法在計(jì)算機(jī)視覺(jué)研究中不是很新穎。那時(shí)，當(dāng)仍然手動(dòng)設(shè)計(jì)要素時(shí)，要素金字塔已經(jīng)是一種識(shí)別不同比例尺圖案的非常有效的方法。在深度學(xué)習(xí)中使用功能金字塔也不是一個(gè)新主意：SSPNet，FCN和SSD都展示了在分類(lèi)之前聚合多層功能的好處。然而，如何在RPN和基于區(qū)域的檢測(cè)器之間共享特征金字塔仍有待確定。

Feature Pyramid Networks for Object Detection用于目標(biāo)檢測(cè)的特征金字塔網(wǎng)絡(luò)””

First, to rebuild RPN with an FPN structure like the diagram above, we need to have a region proposal running on multiple different scales of feature output. Also, we only need 3 anchors with different aspect ratios per location now because objects with different sizes will be handle by different levels of the feature pyramid. Next, to use an FPN structure in the Fast R-CNN detector, we also need to adapt it to detect on multiple scales of feature maps as well. Since region proposals might have different scales too, we should use them in the corresponding level of FPN as well. In short, if Faster R-CNN is a pair of RPN and region-based detector running on one scale, FPN converts it into multiple parallel branches running on different scales and collects the final results from all branches in the end.

首先，要使用上圖所示的FPN結(jié)構(gòu)重建RPN，我們需要有一個(gè)在多個(gè)不同比例的要素輸出上運(yùn)行的區(qū)域提議。另外，現(xiàn)在每個(gè)位置只需要3個(gè)具有不同長(zhǎng)寬比的錨點(diǎn)，因?yàn)榫哂胁煌笮〉膶?duì)象將由要素金字塔的不同級(jí)別處理。接下來(lái)，要在Fast R-CNN檢測(cè)器中使用FPN結(jié)構(gòu)，我們還需要對(duì)其進(jìn)行調(diào)整以在多個(gè)比例的特征圖上進(jìn)行檢測(cè)。由于區(qū)域提案的規(guī)模也可能不同，因此我們也應(yīng)在相應(yīng)的FPN級(jí)別中使用它們。簡(jiǎn)而言之，如果Faster R-CNN是一對(duì)以單個(gè)比例運(yùn)行的RPN和基于區(qū)域的檢測(cè)器，則FPN會(huì)將其轉(zhuǎn)換為以不同比例運(yùn)行的多個(gè)并行分支，并最終收集所有分支的最終結(jié)果。

2016年：YOLO v2 (2016: YOLO v2)

YOLO9000: Better, Faster, Stronger

YOLO9000：更好，更快，更強(qiáng)大

While Kaiming He, Ross Girshick, and their team keep improving their two-stage R-CNN detectors, Joseph Redmon, on the other hand, was also busy improving his one-stage YOLO detector. The initial version of YOLO suffers from many shortcomings: predictions based on a coarse grid brought lower localization accuracy, two scale-agnostic regressors per grid cell also made it difficult to recognize small packed objects. Fortunately, we saw too many great innovations in 2015 in many computer vision areas. YOLO v2 just needs to find a way to integrate them all to become better, faster, and stronger. Here are some highlights of the modifications:

在何凱明，羅斯·吉爾希克(Ross Girshick)及其團(tuán)隊(duì)不斷改進(jìn)其兩階段R-CNN探測(cè)器的同時(shí)，約瑟夫·雷德蒙(Joseph Redmon)也忙于改進(jìn)其一階段YOLO探測(cè)器。 YOLO的初始版本存在許多缺點(diǎn)：基于粗網(wǎng)格的預(yù)測(cè)帶來(lái)較低的定位精度，每個(gè)網(wǎng)格單元有兩個(gè)與規(guī)模無(wú)關(guān)的回歸變量，這也使得難以識(shí)別小的包裝物體。幸運(yùn)的是，2015年我們?cè)谠S多計(jì)算機(jī)視覺(jué)領(lǐng)域看到了太多偉大的創(chuàng)新。 YOLO v2只需要找到一種方法來(lái)整合它們，使其變得更好，更快，更強(qiáng)大。以下是修改的一些重點(diǎn)：

YOLO v2 added Batch Normalization layers from a paper called “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”.
YOLO v2從名為“ 批處理規(guī)范化：通過(guò)減少內(nèi)部協(xié)變量偏移來(lái)加速深度網(wǎng)絡(luò)訓(xùn)練 ”的論文中添加了批處理規(guī)范化層。

YOLO9000: Better, Faster, StrongerYOLO9000：更好，更快，更強(qiáng)””

Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of anchor boxes for bounding box regression. But YOLO v2 did some customization for its anchor boxes. Instead of predicting offsets to anchor boxes, YOLOv2 constraints the object center regression tx and ty within the responsible grid cell to stabilize early training. Also, anchors sizes are determined by a K-means clustering of the target dataset to better align with object shapes.
就像SSD一樣，YOLO v2也引入了Faster R-CNN的錨框用于邊界框回歸的想法。但是YOLO v2對(duì)其錨框進(jìn)行了一些自定義。 YOLOv2并沒(méi)有預(yù)測(cè)錨框的偏移量，而是限制了責(zé)任網(wǎng)格單元內(nèi)的對(duì)象中心回歸tx和ty來(lái)穩(wěn)定早期訓(xùn)練。同樣，錨點(diǎn)的大小由目標(biāo)數(shù)據(jù)集的K均值聚類(lèi)確定，以更好地與對(duì)象形狀對(duì)齊。
A new backbone network called Darknet is used for feature extraction. This is inspired by “Network in Network” and GooLeNet’s bottleneck structure.
一種稱(chēng)為Darknet的新骨干網(wǎng)絡(luò)用于特征提取。這受到“ 網(wǎng)絡(luò)中的網(wǎng)絡(luò) ”和GooLeNet瓶頸結(jié)構(gòu)的啟發(fā)。
To improve the detection of small objects, YOLO v2 added a passthrough layer to merge features from an early layer. This part can be seen as a simplified version of SSD.
為了改善對(duì)小物體的檢測(cè)，YOLO v2添加了直通層以合并早期層中的要素。這部分可以看作是SSD的簡(jiǎn)化版本。
Last but not least, Joseph realized that input resolution is a silver bullet for small object detection. It not only doubled the input for the backbone to 448x448 from 224x224 but also invented a multi-scale training schema, which involves different input resolutions at different periods of training.
最后但并非最不重要的一點(diǎn)，約瑟夫意識(shí)到輸入分辨率是檢測(cè)小物體的靈丹妙藥。它不僅將骨干網(wǎng)的輸入從224x224增加了一倍，達(dá)到448x448，而且發(fā)明了一種多尺度訓(xùn)練方案，該方案在不同的訓(xùn)練階段涉及不同的輸入分辨率。

Note that YOLO v2 also experimented with a version that’s trained on 9000 classes hierarchical datasets, which also represents an early trial of multi-label classification in an object detector.

請(qǐng)注意，YOLO v2還試驗(yàn)了針對(duì)9000類(lèi)分層數(shù)據(jù)集進(jìn)行訓(xùn)練的版本，這也代表了在對(duì)象檢測(cè)器中進(jìn)行多標(biāo)簽分類(lèi)的早期嘗試。

2017年：RetinaNet (2017: RetinaNet)

Focal Loss for Dense Object Detection

密集物體檢測(cè)的焦點(diǎn)損失

To understand why one-stage detectors are usually not as good as two-stage detectors, RetinaNet investigated the foreground-background class imbalance issue from a one-stage detector’s dense predictions. Take YOLO as an example, it tried to predict classes and bounding boxes for all possible locations in the meantime, so most of the outputs are matched to negative class during training. SSD addressed this issue by online hard example mining. YOLO used an objectiveness score to implicitly train a foreground classifier in the early stage of training. RetinaNet thinks they both didn’t get the key to the problem, so it invented a new loss function called Focal Loss to help the network learn what’s important.

為了了解為什么一級(jí)檢測(cè)器通常不如二級(jí)檢測(cè)器好，RetinaNet從一級(jí)檢測(cè)器的密集預(yù)測(cè)中研究了前景-背景類(lèi)不平衡問(wèn)題。以YOLO為例，它試圖同時(shí)預(yù)測(cè)所有可能位置的類(lèi)別和邊界框，因此在訓(xùn)練過(guò)程中，大多數(shù)輸出??都與否定類(lèi)別匹配。 SSD通過(guò)在線硬示例挖掘解決了此問(wèn)題。 YOLO在訓(xùn)練的早期階段使用客觀分?jǐn)?shù)來(lái)隱式訓(xùn)練前景分類(lèi)器。 RetinaNet認(rèn)為他們倆都沒(méi)有找到解決問(wèn)題的關(guān)鍵，因此，它發(fā)明了一個(gè)名為Focal Loss的新?lián)p失函數(shù)，以幫助網(wǎng)絡(luò)了解重要信息。

From “摘自“ Focal Loss for Dense Object Detection密集物體檢測(cè)的焦點(diǎn)損失””

Focal Loss added a power γ (they call it focusing parameter) to Cross-Entropy loss. Naturally, as the confidence score becomes higher, the loss value will become much lower than a normal Cross-Entropy. The α parameter is used to balance such a focusing effect.

焦點(diǎn)損耗為交叉熵?fù)p耗增加了冪γ(它們稱(chēng)為聚焦參數(shù))。自然，隨著置信度得分的提高，損失值將變得比正常的交叉熵要低得多。 α參數(shù)用于平衡這種聚焦效果。

Focal Loss for Dense Object Detection密集物體檢測(cè)的焦點(diǎn)損失””

This idea is so simple that even a primary school student can understand. So to further justify their work, they adapted the FPN model they previously proposed and created a new one-stage detector called RetinaNet. It is composed of a ResNet backbone, an FPN detection neck to channel features at different scales, and two subnets for classification and box regression as detection head. Similar to SSD and YOLO v3, RetinaNet uses anchor boxes to cover targets of various scales and aspect ratios.

這個(gè)想法非常簡(jiǎn)單，甚至小學(xué)生也能理解。因此，為進(jìn)一步證明他們的工作合理性，他們改用了他們先前提出的FPN模型，并創(chuàng)建了一個(gè)新的稱(chēng)為RetinaNet的單級(jí)探測(cè)器。它由ResNet主干網(wǎng)，FPN檢測(cè)頸以不同尺度的通道特征組成，以及兩個(gè)用于分類(lèi)和框回歸的子網(wǎng)作為檢測(cè)頭。與SSD和YOLO v3相似，RetinaNet使用錨框覆蓋各種比例和縱橫比的目標(biāo)。

A bit of a digression, RetinaNet used the COCO accuracy from a ResNeXT-101 and 800 input resolution variant to contrast YOLO v2, which only has a light-weighted Darknet-19 backbone and 448 input resolution. This insincerity shows the team’s emphasis on getting better benchmark results, rather than solving a practical issue like a speed-accuracy trade-off. And it might be part of the reason that RetinaNet didn’t take off after its release.

有點(diǎn)離題，RetinaNet使用ResNeXT-101和800輸入分辨率變體的COCO精度來(lái)對(duì)比YOLO v2，后者僅具有輕量級(jí)Darknet-19主干和448輸入分辨率。這種不誠(chéng)實(shí)表明團(tuán)隊(duì)強(qiáng)調(diào)獲得更好的基準(zhǔn)測(cè)試結(jié)果，而不是解決諸如速度精度折衷之類(lèi)的實(shí)際問(wèn)題。這可能是RetinaNet發(fā)布后沒(méi)有起飛的部分原因。

2018年：YOLO v3 (2018: YOLO v3)

YOLOv3: An Incremental Improvement

YOLOv3：增量改進(jìn)

YOLO v3 is the last version of the official YOLO series. Following YOLO v2’s tradition, YOLO v3 borrowed more ideas from previous research and got an incredible powerful one-stage detector like a monster. YOLO v3 balanced the speed, accuracy, and implementation complexity pretty well. And it got really popular in the industry because of its fast speed and simple components. If you are interested, I wrote a very detailed explanation of how YOLO v3 works in my previous article “Dive Really Deep into YOLO v3: A Beginner’s Guide”.

YOLO v3是YOLO官方系列的最新版本。遵循YOLO v2的傳統(tǒng)，YOLO v3借鑒了以前的研究中的更多想法，并獲得了令人難以置信的強(qiáng)大的一級(jí)檢測(cè)器，就像怪物一樣。 YOLO v3很好地平衡了速度，準(zhǔn)確性和實(shí)現(xiàn)復(fù)雜性。由于它的快速和簡(jiǎn)單的組件，它在行業(yè)中非常流行。如果您有興趣，我在我之前的文章“ 深入研究YOLO v3：入門(mén)指南 ”中對(duì)YOLO v3的工作方式進(jìn)行了非常詳細(xì)的說(shuō)明。

Dive Really Deep into YOLO v3: A Beginner’s Guide”深入研究YOLO v3：入門(mén)指南 ” Dive Really Deep into YOLO v3: A Beginner’s Guide”深入研究YOLO v3：入門(mén)指南 ”

Simply put, YOLO v3’s success comes from its more powerful backbone feature extractor and a RetinaNet-like detection head with an FPN neck. The new backbone network Darknet-53 leveraged ResNet’s skip connections to achieve an accuracy that’s on par with ResNet-50 but much faster. Also, YOLO v3 ditched v2’s pass through layers and fully embraced FPN’s multi-scale predictions design. Since then, YOLO v3 finally reversed people’s impression of its poor performance when dealing with small objects.

簡(jiǎn)而言之，YOLO v3的成功來(lái)自更強(qiáng)大的主干特征提取器和帶有FPN頸部的類(lèi)似RetinaNet的檢測(cè)頭。新的骨干網(wǎng)Darknet-53利用ResNet的跳過(guò)連接來(lái)實(shí)現(xiàn)與ResNet-50相同的準(zhǔn)確性，但速度要快得多。此外，YOLO v3放棄了v2的通過(guò)層，并完全接受了FPN的多尺度預(yù)測(cè)設(shè)計(jì)。從那時(shí)起，YOLO v3終于扭轉(zhuǎn)了人們對(duì)處理小物體時(shí)性能低下的印象。

Besides, there are a few fun facts about YOLO v3. It dissed the COCO mAP 0.5:0.95 metric, and also demonstrated the uselessness of Focal Loss when using a conditioned dense prediction. The author Joseph even decided to quit the whole computer vision research a year later, because of his concern of military usage.

此外，關(guān)于YOLO v3還有一些有趣的事實(shí)。它放棄了COCO mAP 0.5：0.95指標(biāo)，并且還證明了在使用條件密集預(yù)測(cè)時(shí)，焦點(diǎn)損失的無(wú)用性。由于擔(dān)心軍事用途，作者約瑟夫甚至決定在一年后退出整個(gè)計(jì)算機(jī)視覺(jué)研究。

2019：以對(duì)象為點(diǎn) (2019: Objects As Points)

Although the image classification area becomes less active recently, object detection research is still far from mature. In 2018, a paper called “CornerNet: Detecting Objects as Paired Keypoints” provided a new perspective for detector training. Since preparing anchor box targets is a quite cumbersome job, is it really necessary to use them as a prior? This new trend of ditching anchor boxes is called “anchor-free” object detection.

盡管近來(lái)圖像分類(lèi)領(lǐng)域變得不太活躍，但是對(duì)象檢測(cè)研究還遠(yuǎn)遠(yuǎn)不成熟。 2018年，一篇名為“ CornerNet：將對(duì)象檢測(cè)為配對(duì)的關(guān)鍵點(diǎn)”的論文為檢測(cè)器培訓(xùn)提供了新的視角。由于準(zhǔn)備錨定框目標(biāo)是一項(xiàng)非常繁瑣的工作，因此真的有必要先使用它們嗎？拋開(kāi)錨框的這種新趨勢(shì)被稱(chēng)為“無(wú)錨”物體檢測(cè)。

Stacked Hourglass Networks for Human Pose Estimation”用于人體姿勢(shì)估計(jì)的堆疊沙漏網(wǎng)絡(luò) ”

Inspired by the use of heat-map in the Hourglass network for human pose estimation, CornerNet uses a heat-map generated by box corners to supervise the bounding box regression. To learn more about how heat-map is used in Hourglass Network, you can read my previous article “Human Pose Estimation with Stacked Hourglass Network and TensorFlow”.

受Hourglass網(wǎng)絡(luò)中使用熱圖進(jìn)行人體姿勢(shì)估計(jì)的啟發(fā)，CornerNet使用由框角生成的熱圖來(lái)監(jiān)督邊界框回歸。要了解有關(guān)在沙漏網(wǎng)絡(luò)中如何使用熱圖的更多信息，您可以閱讀我以前的文章“ 使用堆疊式沙漏網(wǎng)絡(luò)和TensorFlow進(jìn)行人體姿勢(shì)估計(jì) ”。

Objects as Points物體作為點(diǎn)””

Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to represent object centers, and the network will regress the box width and height directly from these box centers. Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed heat-map, the training is also easier to converge compared with previous attempts which tried to regress bounding box size directly.

對(duì)象即點(diǎn)(又名CenterNet)又走了一步。它使用熱圖峰表示對(duì)象中心，網(wǎng)絡(luò)將直接從這些盒子中心回歸盒子的寬度和高度。本質(zhì)上，CenterNet會(huì)將每個(gè)像素用作網(wǎng)格單元。與以前的嘗試直接使邊界框大小回歸的嘗試相比，使用高斯分布的熱圖，訓(xùn)練也更易于收斂。

The elimination of anchor boxes also has another useful side effect. Previously, we rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to assign training targets. By doing so, a few neighboring anchors may get all assigned a positive target for the same object. And the network will learn to predict multiple positive boxes for the same object too. The common way to fix this issue is to use a technique called Non-maximum Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close together. Now that anchors are gone and we only have one peak per object in the heat-map, there’s no need to use NMS any more. Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a big benefit for the applications that run in various environments with limited resources.

消除錨固盒也具有另一個(gè)有用的副作用。以前，我們依靠錨框和地面真值框之間的IOU(例如> 0.7)來(lái)分配訓(xùn)練目標(biāo)。這樣，幾個(gè)相鄰的錨可能會(huì)為所有相同對(duì)象分配正目標(biāo)。網(wǎng)絡(luò)還將學(xué)習(xí)預(yù)測(cè)同一對(duì)象的多個(gè)肯定框。解決此問(wèn)題的常用方法是使用一種稱(chēng)為非最大抑制(NMS)的技術(shù)。這是一種貪婪的算法，可過(guò)濾掉距離太近的盒子。現(xiàn)在，錨點(diǎn)已經(jīng)消失了，并且熱圖中每個(gè)對(duì)象只有一個(gè)峰，因此不再需要使用NMS。由于NMS有時(shí)難以實(shí)施且運(yùn)行緩慢，因此對(duì)于在資源有限的各種環(huán)境中運(yùn)行的應(yīng)用程序而言，擺脫NMS的好處是它的一大優(yōu)勢(shì)。

2019年：EfficientDet (2019: EfficientDet)

EfficientDet: Scalable and Efficient Object Detection

EfficientDet：可擴(kuò)展且高效的對(duì)象檢測(cè)

EfficientDet: Scalable and Efficient Object DetectionEfficientDet：可擴(kuò)展且高效的對(duì)象檢測(cè)””

In the recent CVPR’20, EfficientDet showed us some more exciting development in the object detection area. FPN structure has been proved to be a powerful technique to improve the detection network’s performance for objects at different scales. Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN neck before box regression and classification. Later, NAS-FPN and PANet (please refer to Read More section) both demonstrated that a plain multi-layer FPN structure may benefit from more design optimization. EfficientDet continued exploring in this direction, eventually created a new neck called BiFPN. Basically, BiFPN features additional cross-layer connections to encourage feature aggregation back and forth. To justify the efficiency part of the network, this BiFPN also removed some less useful connections from the original PANet design. Another innovative improvement over the FPN structure is the weight feature fusion. BiFPN added additional learnable weights to feature aggregation so that the network can learn the importance of different branches.

在最近的CVPR'20中，EfficientDet向我們展示了物體檢測(cè)領(lǐng)域中一些更令人興奮的發(fā)展。事實(shí)證明，FPN結(jié)構(gòu)是一種改進(jìn)檢測(cè)網(wǎng)絡(luò)針對(duì)不同規(guī)模物體的性能的強(qiáng)大技術(shù)。諸如RetinaNet和YOLO v3之類(lèi)的著名檢測(cè)網(wǎng)絡(luò)在盒回歸和分類(lèi)之前都采用了FPN頸。后來(lái)，NAS-FPN和PANet(請(qǐng)參閱“”部分)都證明，普通的多層FPN結(jié)構(gòu)可能會(huì)受益于更多的設(shè)計(jì)優(yōu)化。 EfficientDet繼續(xù)朝這個(gè)方向進(jìn)行探索，最終創(chuàng)建了一個(gè)稱(chēng)為BiFPN的新脖子。基本上，BiFPN具有附加的跨層連接，以鼓勵(lì)來(lái)回地進(jìn)行特征聚合。為了證明網(wǎng)絡(luò)的效率部分是合理的，此BiFPN還從原始PANet設(shè)計(jì)中刪除了一些不太有用的連接。 FPN結(jié)構(gòu)的另一個(gè)創(chuàng)新改進(jìn)是重量特征融合。 BiFPN為功能聚合增加了其他可學(xué)習(xí)的權(quán)重，以便網(wǎng)絡(luò)可以了解不同分支機(jī)構(gòu)的重要性。

EfficientDet: Scalable and Efficient Object DetectionEfficientDet：可擴(kuò)展且高效的對(duì)象檢測(cè)””

Moreover, just like what we saw in the image classification network EfficientNet, EfficientDet also introduced a principled way to scale an object detection network. The φ parameter in the above formula controls both width (channels) and depth (layers) of both BiFPN neck and detection head.

此外，就像我們?cè)趫D像分類(lèi)網(wǎng)絡(luò)EfficientNet中看到的一樣，EfficientDet也引入了一種原理化的方法來(lái)擴(kuò)展對(duì)象檢測(cè)網(wǎng)絡(luò)。上式中的φ參數(shù)控制BiFPN頸部和檢測(cè)頭的寬度(通道)和深度(層)。

EfficientDet: Scalable and Efficient Object DetectionEfficientDet：可擴(kuò)展且高效的對(duì)象檢測(cè)””

This new parameter results in 8 different variants of EfficientDet from D0 to D7. A light-weighed D0 variant can achieve similar accuracy with YOLO v3 while having much fewer FLOPs. A heavy-loaded D7 variant with monstrous 1536x1536 input can even reach 53.7 AP on COCO that dwarfed all other contenders.

此新參數(shù)導(dǎo)致從D0到D7的8種不同的EfficientDet變體。重量輕的D0變體可以與YOLO v3達(dá)到類(lèi)似的精度，而FLOP則少得多。重負(fù)載的D7變種具有驚人的1536x1536輸入，在COCO上甚至可以達(dá)到53.7 AP，這使所有其他競(jìng)爭(zhēng)者都相形見(jiàn)war。

(Read More)

From R-CNN, YOLO to recent CenterNet and EfficientDet, we have witnessed most major innovations in the object detection research in the deep learning era. Aside from the above papers, I’ve also provided a list of additional papers for you to keep reading to get a deeper understanding. They either provided a different perspective for object detection or extended this area with more powerful features.

從R-CNN，YOLO到最新的CenterNet和EfficientDet，我們見(jiàn)證了深度學(xué)習(xí)時(shí)代對(duì)象檢測(cè)研究中的大多數(shù)重大創(chuàng)新。除了以上論文，我還提供了一些其他論文清單，供您繼續(xù)閱讀以加深了解。他們?yōu)槲矬w檢測(cè)提供了不同的視角，或者通過(guò)更強(qiáng)大的功能擴(kuò)展了該領(lǐng)域。

2009年：DPM (2009: DPM)

Object Detection with Discriminatively Trained Part Based Models

具有區(qū)別訓(xùn)練的基于零件的模型的目標(biāo)檢測(cè)

By matching many HOG features for each deformable parts, DPM was one of the most efficient object detection models before the deep learning era. Take pedestrian detection as an example, it uses a star structure to recognize the general person pattern first, and then recognize parts with different sub-filters and calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular after we switch from HOG features to CNN features.

通過(guò)為每個(gè)可變形零件匹配許多HOG功能，DPM是深度學(xué)習(xí)時(shí)代之前最有效的對(duì)象檢測(cè)模型之一。以行人檢測(cè)為例，它使用星形結(jié)構(gòu)先識(shí)別一般人的模式，然后再識(shí)別具有不同子過(guò)濾器的部分并計(jì)算總得分。即使在今天，當(dāng)我們從HOG功能切換到CNN功能之后，識(shí)別具有可變形零件的對(duì)象的想法仍然很流行。

2012：選擇性搜尋 (2012: Selective Search)

Selective Search for Object Recognition

選擇性搜索對(duì)象識(shí)別

Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined so many classical computer vision approaches together, and also used in the early R-CNN detector. The core idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge similar areas together. And these merged area areas served as foreground predictions and followed by an SVM classifier for object recognition.

像DPM一樣，選擇性搜索也不是深度學(xué)習(xí)時(shí)代的產(chǎn)物。但是，這種方法將許多經(jīng)典的計(jì)算機(jī)視覺(jué)方法結(jié)合在一起，并且也用于早期的R-CNN檢測(cè)器。選擇性搜索的核心思想是受語(yǔ)義分割啟發(fā)的，其中像素按相似性分組。選擇性搜索使用不同的相似性標(biāo)準(zhǔn)(例如顏色空間和基于SIFT的紋理)將相似區(qū)域迭代合并在一起。這些合并的區(qū)域用作前景預(yù)測(cè)，隨后是用于對(duì)象識(shí)別的SVM分類(lèi)器。

2016年：R-FCN (2016: R-FCN)

R-FCN: Object Detection via Region-based Fully Convolutional Networks

R-FCN：通過(guò)基于區(qū)域的全卷積網(wǎng)絡(luò)進(jìn)行對(duì)象檢測(cè)

Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for each region proposal, we still need fully connected layers to compute class and bounding box separately. If we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score maps. This map will predict the probability of the appearance of the object at each location, and all locations will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To understand what is atrous convolution, please see my previous article “Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+”.

更快的R-CNN最終將RPN和ROI特征提取結(jié)合在一起，大大提高了速度。但是，對(duì)于每個(gè)區(qū)域建議，我們?nèi)匀恍枰耆B接的圖層來(lái)分別計(jì)算類(lèi)和邊界框。如果我們有300個(gè)ROI，則需要重復(fù)進(jìn)行300次，這也是一級(jí)和二級(jí)檢測(cè)器之間主要速度差異的根源。 R-FCN借用了FCN的思想進(jìn)行語(yǔ)義分割，但是R-FCN代替了計(jì)算類(lèi)別掩碼，而是計(jì)算了一個(gè)正敏感分?jǐn)?shù)圖。該圖將預(yù)測(cè)對(duì)象在每個(gè)位置出現(xiàn)的可能性，所有位置將投票(取平均值)以決定最終的類(lèi)和邊界框。此外，R-FCN在其ResNet主干網(wǎng)中也使用了粗糙卷積，該卷積網(wǎng)最初是在DeepLab語(yǔ)義分段網(wǎng)絡(luò)中提出的。要了解什么是無(wú)意義的卷積，請(qǐng)參閱我以前的文章“ 見(jiàn)證語(yǔ)義分割的進(jìn)展：從V1到V3 +的DeepLab系列 ”。

2017年：Soft-NMS (2017: Soft-NMS)

Improving Object Detection With One Line of Code

用一行代碼改善對(duì)象檢測(cè)

Non-maximum suppression (NMS) is widely used in anchor-based object detection networks to reduce duplicate positive proposals that are close-by. More specifically, NMS iteratively eliminates candidate boxes if they have a high IOU with a more confident candidate box. This could lead to some unexpected behavior when two objects with the same class are indeed very close to each other. Soft-NMS made a small change to only scaling down the confidence score of the overlapped candidate boxes with a parameter. This scaling parameter gives us more control when tuning the localization performance, and also leads to a better precision when a high recall is also needed.

非最大抑制(NMS)廣泛用于基于錨的對(duì)象檢測(cè)網(wǎng)絡(luò)中，以減少附近的重復(fù)肯定建議。更具體地說(shuō)，如果NMS具有較高的IOU和更自信的候選框，則NMS迭代地消除它們。當(dāng)具有相同類(lèi)的兩個(gè)對(duì)象確實(shí)彼此非常接近時(shí)，這可能導(dǎo)致某些意外行為。 Soft-NMS進(jìn)行了很小的更改，僅按比例縮小了重疊候選框的置信度。當(dāng)調(diào)整本地化性能時(shí)，此縮放參數(shù)可為我們提供更多控制權(quán)，并且在還需要較高召回率的情況下，還可以提高精度。

2017年：Cascade R-CNN (2017: Cascade R-CNN)

Cascade R-CNN: Delving into High Quality Object Detection

級(jí)聯(lián)R-CNN：致力于高質(zhì)量目標(biāo)檢測(cè)

While FPN exploring how to design a better R-CNN neck to use backbone features Cascade R-CNN investigated a redesign of R-CNN classification and regression head. The underlying assumption is simple yet insightful: the higher IOU criteria we use when preparing positive targets, the less false positive predictions the network will learn to make. However, we can’t simply increase such IOU threshold from commonly used 0.5 to more aggressive 0.7, because it could also lead to more overwhelming negative examples during training. Cascade R-CNN’s solution is to chain multiple detection head together, each will rely on the bounding box proposals from the previous detection head. Only the first detection head will use the original RPN proposals. This effectively simulated an increasing IOU threshold for latter heads.

在FPN探索如何設(shè)計(jì)更好的R-CNN頸部以使用骨干功能時(shí)，Cascade R-CNN研究了R-CNN分類(lèi)和回歸頭的重新設(shè)計(jì)。基本假設(shè)是簡(jiǎn)單而有見(jiàn)地的：準(zhǔn)備正目標(biāo)時(shí)使用的IOU標(biāo)準(zhǔn)越高，網(wǎng)絡(luò)將學(xué)會(huì)做出的假正預(yù)測(cè)就越少。但是，我們不能簡(jiǎn)單地將此類(lèi)IOU閾值從常用的0.5提高到更具侵略性的0.7，因?yàn)樗€可能導(dǎo)致訓(xùn)練過(guò)程中出現(xiàn)大量的負(fù)面案例。 Cascade R-CNN的解決方案是將多個(gè)檢測(cè)頭鏈接在一起，每個(gè)都將依賴(lài)于先前檢測(cè)頭的邊界框建議。僅第一個(gè)檢測(cè)頭將使用原始RPN建議。這有效地模擬了后頭的增加的IOU閾值。

2017: Mask R-CNN (2017: Mask R-CNN)

Mask R-CNN

Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.

Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, ie, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.

2018: PANet (2018: PANet)

Path Aggregation Network for Instance Segmentation

Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of sum) multi-scale features.

Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN's ROIAlign to merge (element-wise max of sum) multi-scale features.

2019: NAS-FPN (2019: NAS-FPN)

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

PANet’s success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.

PANet's success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.

結(jié)論 (Conclusion)

Object detection is still an active research area. Although the general landscape of this field is well shaped by a two-stage detector like R-CNN and one-stage detector such as YOLO, our best detector is still far from saturating the benchmark metrics, and also misses many targets in complicated background. At the same time, Anchor-free detector like CenterNet showed us a promising future where object detection networks can become as simple as image classification networks. Other directions of object detection, such as few-shot recognition and NAS, are still at an early age, and we will see how it goes in the next few years. Nevertheless, as object detection technology becomes more mature, we need to be very cautious about its adoption by the military and police. A dystopia where Terminators hunt and shoot humans with a YOLO detector is the last thing we want to see in our life.

Originally published at http://yanjia.li on Aug 9, 2020

翻譯自: https://towardsdatascience.com/12-papers-you-should-read-to-understand-object-detection-in-the-deep-learning-era-3390d4a28891

深度學(xué)習(xí)之對(duì)象檢測(cè)

總結(jié)

以上是生活随笔為你收集整理的深度学习之对象检测_深度学习时代您应该阅读的12篇文章，以了解对象检测的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：鸿星尔克基金叫什么
下一篇： TensorFlow 2.X中的动手NL