目标检测经典论文——Fast R-CNN论文翻译(中英文对照版):Fast R-CNN(Ross Girshick, Microsoft Research(微软研究院))
目標(biāo)檢測(cè)經(jīng)典論文翻譯匯總:[翻譯匯總]
翻譯pdf文件下載:[下載地址]
此版為純中文版,中英文對(duì)照版請(qǐng)穩(wěn)步:[Fast?R-CNN純中文版]
Fast R-CNNRoss GirshickMicrosoft Research(微軟研究院)rbg@microsoft.com |
Abstract
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.
摘要
本文提出了一種快速的基于區(qū)域的卷積網(wǎng)絡(luò)方法(fast R-CNN)用于目標(biāo)檢測(cè)。Fast R-CNN建立在以前使用的深卷積網(wǎng)絡(luò)有效地分類(lèi)目標(biāo)的成果上。相比于之前的研究工作,Fast R-CNN采用了多項(xiàng)創(chuàng)新提高了訓(xùn)練和測(cè)試速度,同時(shí)也提高了檢測(cè)準(zhǔn)確度。Fast R-CNN訓(xùn)練非常深的VGG16網(wǎng)絡(luò)比R-CNN快9倍,測(cè)試時(shí)快213倍,并在PASCAL VOC上得到了更高的準(zhǔn)確度。與SPPnet相比,Fast R-CNN訓(xùn)練VGG16網(wǎng)絡(luò)比他快3倍,測(cè)試速度快10倍,并且更準(zhǔn)確。Fast R-CNN的Python和C ++(使用Caffe)實(shí)現(xiàn)以MIT開(kāi)源許可證發(fā)布在:https://github.com/rbgirshick/fast-rcnn。
1. Introduction
Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.
1. 引言
最近,深度卷積網(wǎng)絡(luò)[14, 16]已經(jīng)顯著提高了圖像分類(lèi)[14]和目標(biāo)檢測(cè)[9, 19]的準(zhǔn)確性。與圖像分類(lèi)相比,目標(biāo)檢測(cè)是一個(gè)更具挑戰(zhàn)性的任務(wù),需要更復(fù)雜的方法來(lái)解決。由于這種復(fù)雜性,當(dāng)前的方法(例如,[9, 11, 19, 25])采用多級(jí)pipelines的方式訓(xùn)練模型,既慢且精度不高。
Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.
復(fù)雜性的產(chǎn)生是因?yàn)闄z測(cè)需要目標(biāo)的精確定位,這就導(dǎo)致兩個(gè)主要的難點(diǎn)。首先,必須處理大量候選目標(biāo)位置(通常稱(chēng)為“proposals”)。 第二,這些候選框僅提供粗略定位,其必須被精細(xì)化以實(shí)現(xiàn)精確定位。 這些問(wèn)題的解決方案經(jīng)常會(huì)影響速度、準(zhǔn)確性或簡(jiǎn)潔性。
In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
在本文中,我們簡(jiǎn)化了最先進(jìn)的基于卷積網(wǎng)絡(luò)的目標(biāo)檢測(cè)器的訓(xùn)練過(guò)程[9, 11]。我們提出一個(gè)單階段訓(xùn)練算法,聯(lián)合學(xué)習(xí)候選框分類(lèi)和修正他們的空間位置。
The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).
最終方法能夠訓(xùn)練非常深的檢測(cè)網(wǎng)絡(luò)(例如VGG16),其網(wǎng)絡(luò)比R-CNN快9倍,比SPPnet快3倍。在運(yùn)行時(shí),檢測(cè)網(wǎng)絡(luò)在PASCAL VOC 2012數(shù)據(jù)集上實(shí)現(xiàn)最高準(zhǔn)確度,其中mAP為66%(R-CNN為62%),每張圖像處理時(shí)間為0.3秒,不包括候選框的生成(注:所有的時(shí)間都是使用一個(gè)超頻875MHz的Nvidia K40 GPU測(cè)試的)。
1.1. RCNN and SPPnet
The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:
1. Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).
1.1. R-CNN與SPPnet
基于區(qū)域的卷積網(wǎng)絡(luò)方法(R-CNN)[9]通過(guò)使用深度卷積網(wǎng)絡(luò)來(lái)分類(lèi)目標(biāo)候選框,獲得了很高的目標(biāo)檢測(cè)精度。然而,R-CNN具有明顯的缺點(diǎn):
1. 訓(xùn)練過(guò)程是多級(jí)pipeline。R-CNN首先使用目標(biāo)候選框?qū)矸e神經(jīng)網(wǎng)絡(luò)使用log損失進(jìn)行fine-tune。然后,它將卷積神經(jīng)網(wǎng)絡(luò)得到的特征送入SVM。這些SVM作為目標(biāo)檢測(cè)器,替代通過(guò)fine-tune學(xué)習(xí)的softmax分類(lèi)器。在第三個(gè)訓(xùn)練階段,學(xué)習(xí)bounding-box回歸器。
2. 訓(xùn)練在時(shí)間和空間上是的開(kāi)銷(xiāo)很大。對(duì)于SVM和bounding-box回歸訓(xùn)練,從每個(gè)圖像中的每個(gè)目標(biāo)候選框提取特征,并寫(xiě)入磁盤(pán)。對(duì)于VOC07 trainval上的5k個(gè)圖像,使用如VGG16非常深的網(wǎng)絡(luò)時(shí),這個(gè)過(guò)程在單個(gè)GPU上需要2.5天。這些特征需要數(shù)百GB的存儲(chǔ)空間。
3. 目標(biāo)檢測(cè)速度很慢。在測(cè)試時(shí),從每個(gè)測(cè)試圖像中的每個(gè)目標(biāo)候選框提取特征。用VGG16網(wǎng)絡(luò)檢測(cè)目標(biāo)時(shí),每個(gè)圖像需要47秒(在GPU上)。
R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6×6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.
R-CNN很慢是因?yàn)樗鼮槊總€(gè)目標(biāo)候選框進(jìn)行卷積神經(jīng)網(wǎng)絡(luò)前向傳遞,而沒(méi)有共享計(jì)算。SPPnet網(wǎng)絡(luò)[11]提出通過(guò)共享計(jì)算加速R-CNN。SPPnet計(jì)算整個(gè)輸入圖像的卷積特征圖,然后使用從共享特征圖提取的特征向量來(lái)對(duì)每個(gè)候選框進(jìn)行分類(lèi)。通過(guò)最大池化將候選框內(nèi)的特征圖轉(zhuǎn)化為固定大小的輸出(例如6×6)來(lái)提取針對(duì)候選框的特征。多輸出尺寸被池化,然后連接成空間金字塔池[15]。SPPnet在測(cè)試時(shí)將R-CNN加速10到100倍。由于更快的候選框特征提取,訓(xùn)練時(shí)間也減少了3倍。
SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.
SPP網(wǎng)絡(luò)也有顯著的缺點(diǎn)。像R-CNN一樣,訓(xùn)練過(guò)程是一個(gè)多級(jí)pipeline,涉及提取特征、使用log損失對(duì)網(wǎng)絡(luò)進(jìn)行fine-tuning、訓(xùn)練SVM分類(lèi)器以及最后擬合檢測(cè)框回歸。特征也要寫(xiě)入磁盤(pán)。但與R-CNN不同,在[11]中提出的fine-tuning算法不能更新在空間金字塔池之前的卷積層。不出所料,這種局限性(固定的卷積層)限制了深層網(wǎng)絡(luò)的精度。
1.2. Contributions
We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:
1. Higher detection quality (mAP) than R-CNN, SPPnet
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching
1.2. 貢獻(xiàn)
我們提出一種新的訓(xùn)練算法,修正了R-CNN和SPPnet的缺點(diǎn),同時(shí)提高了速度和準(zhǔn)確性。因?yàn)樗鼙容^快地進(jìn)行訓(xùn)練和測(cè)試,我們稱(chēng)之為Fast R-CNN。Fast RCNN方法有以下幾個(gè)優(yōu)點(diǎn):
1. 比R-CNN和SPPnet具有更高的目標(biāo)檢測(cè)精度(mAP)。
2. 訓(xùn)練是使用多任務(wù)損失的單階段訓(xùn)練。
3. 訓(xùn)練可以更新所有網(wǎng)絡(luò)層參數(shù)。
4. 不需要磁盤(pán)空間緩存特征。
Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Fast R-CNN使用Python和C++(Caffe[13])編寫(xiě),以MIT開(kāi)源許可證發(fā)布在:https://github.com/rbgirshick/fast-rcnn。
2. Fast R-CNN architecture and training
Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.
2. Fast R-CNN架構(gòu)與訓(xùn)練
Fast R-CNN的架構(gòu)如圖1所示。Fast R-CNN網(wǎng)絡(luò)將整個(gè)圖像和一組候選框作為輸入。網(wǎng)絡(luò)首先使用幾個(gè)卷積層(conv)和最大池化層來(lái)處理整個(gè)圖像,以產(chǎn)生卷積特征圖。然后,對(duì)于每個(gè)候選框,RoI池化層從特征圖中提取固定長(zhǎng)度的特征向量。每個(gè)特征向量被送入一系列全連接(fc)層中,其最終分支成兩個(gè)同級(jí)輸出層 :一個(gè)輸出K個(gè)類(lèi)別加上1個(gè)包含所有背景類(lèi)別的Softmax概率估計(jì),另一個(gè)層輸出K個(gè)類(lèi)別的每一個(gè)類(lèi)別輸出四個(gè)實(shí)數(shù)值。每組4個(gè)值表示K個(gè)類(lèi)別中一個(gè)類(lèi)別的修正后檢測(cè)框位置。
圖1. Fast R-CNN架構(gòu)。輸入圖像和多個(gè)感興趣區(qū)域(RoI)被輸入到全卷積網(wǎng)絡(luò)中。每個(gè)RoI被池化到固定大小的特征圖中,然后通過(guò)全連接層(FC)映射到特征向量。網(wǎng)絡(luò)對(duì)于每個(gè)RoI具有兩個(gè)輸出向量:Softmax概率和每類(lèi)bounding-box回歸偏移量。該架構(gòu)是使用多任務(wù)損失進(jìn)行端到端訓(xùn)練的。
2.1. The RoI pooling layer
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W (e.g., 7 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).
2.1. RoI池化層
RoI池化層使用最大池化將任何有效的RoI內(nèi)的特征轉(zhuǎn)換成具有H×W(例如,7×7)的固定空間范圍的小特征圖,其中H和W是層的超參數(shù),獨(dú)立于任何特定的RoI。在本文中,RoI是卷積特征圖中的一個(gè)矩形窗口。每個(gè)RoI由指定其左上角(r,c)及其高度和寬度(h,w)的四元組(r,c,h,w)定義。
RoI max pooling works by dividing the h×w RoI window into an H ×W grid of sub-windows of approximate size h/H×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].
RoI最大池化通過(guò)將大小為h×w的RoI窗口分割成H×W個(gè)網(wǎng)格,子窗口大小約為h/H×w/W,然后對(duì)每個(gè)子窗口執(zhí)行最大池化,并將輸出合并到相應(yīng)的輸出網(wǎng)格單元中。同標(biāo)準(zhǔn)的最大池化一樣,池化操作獨(dú)立應(yīng)用于每個(gè)特征圖通道。RoI層只是SPPnets[11]中使用的空間金字塔池層的特例,其只有一個(gè)金字塔層。我們使用[11]中給出的池化子窗口計(jì)算方法。
2.2. Initializing from pre-trained networks
We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.
2.2 從預(yù)訓(xùn)練網(wǎng)絡(luò)初始化
我們實(shí)驗(yàn)了三個(gè)預(yù)訓(xùn)練的ImageNet [4]網(wǎng)絡(luò),每個(gè)網(wǎng)絡(luò)有五個(gè)最大池化層和5至13個(gè)卷積層(網(wǎng)絡(luò)詳細(xì)信息見(jiàn)4.1節(jié))。當(dāng)預(yù)訓(xùn)練網(wǎng)絡(luò)初始化Fast R-CNN網(wǎng)絡(luò)時(shí),其經(jīng)歷三個(gè)變換。
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).
首先,最后的最大池化層由RoI池層代替,其將H和W設(shè)置為與網(wǎng)絡(luò)的第一個(gè)全連接層兼容的配置(例如,對(duì)于VGG16,H=W=7)。
Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).
其次,網(wǎng)絡(luò)的最后一個(gè)全連接層和Softmax(其被訓(xùn)練用于1000類(lèi)ImageNet分類(lèi))被替換為前面描述的兩個(gè)同級(jí)層(全連接層和K+1個(gè)類(lèi)別的Softmax以及特定類(lèi)別的bounding-box回歸)。
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
最后,網(wǎng)絡(luò)被修改為采用兩個(gè)數(shù)據(jù)輸入:圖像的列表和這些圖像中的RoI的列表。
2.3. Finetuning for detection
Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.
2.3 檢測(cè)任務(wù)fine-tune
用反向傳播訓(xùn)練所有網(wǎng)絡(luò)權(quán)重是Fast R-CNN的重要能力。首先,讓我們闡明為什么SPPnet無(wú)法更新低于空間金字塔池化層的權(quán)重。
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).
根本原因是當(dāng)每個(gè)訓(xùn)練樣本(即RoI)來(lái)自不同的圖像時(shí),通過(guò)SPP層的反向傳播是非常低效的,這正是訓(xùn)練R-CNN和SPPnet網(wǎng)絡(luò)的方法。低效是因?yàn)槊總€(gè)RoI可能具有非常大的感受野,通常跨越整個(gè)輸入圖像。由于正向傳播必須處理整個(gè)感受野,訓(xùn)練輸入很大(通常是整個(gè)圖像)。
We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).
我們提出了一種更有效的訓(xùn)練方法,利用訓(xùn)練期間的特征共享。在Fast R-CNN網(wǎng)絡(luò)訓(xùn)練中,隨機(jī)梯度下降(SGD)的小批量是被分層采樣的,首先采樣N個(gè)圖像,然后從每個(gè)圖像采樣R/N個(gè)RoI。關(guān)鍵的是,來(lái)自同一圖像的RoI在前向和后向傳播中共享計(jì)算和內(nèi)存。減小N,就減少了小批量的計(jì)算。例如,當(dāng)N=2和R=128時(shí),得到的訓(xùn)練方案比從128幅不同的圖采樣一個(gè)RoI(即R-CNN和SPPnet的策略)快64倍。
One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.
這個(gè)策略的一個(gè)令人擔(dān)心的問(wèn)題是它可能導(dǎo)致訓(xùn)練收斂變慢,因?yàn)閬?lái)自相同圖像的RoI是相關(guān)的。這個(gè)問(wèn)題似乎在實(shí)際情況下并不存在,當(dāng)N=2和R=128時(shí),我們使用比R-CNN更少的SGD迭代就獲得了良好的結(jié)果。
In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.
除了分層采樣,Fast R-CNN使用了一個(gè)精細(xì)的訓(xùn)練過(guò)程,在fine-tune階段聯(lián)合優(yōu)化Softmax分類(lèi)器和bounding-box回歸,而不是分別在三個(gè)獨(dú)立的階段訓(xùn)練softmax分類(lèi)器、SVM和回歸器[9, 11]。下面將詳細(xì)描述該過(guò)程(損失、小批量采樣策略、通過(guò)RoI池化層的反向傳播和SGD超參數(shù))。
Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), p=(p0,…, pK), over K+1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, tk=(tkx, tky, tkw, tkh), for each of the K object classes, indexed by k. We use the parameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.
多任務(wù)損失。Fast R-CNN網(wǎng)絡(luò)具有兩個(gè)同級(jí)輸出層。第一個(gè)輸出在K+1個(gè)類(lèi)別上的離散概率分布(每個(gè)RoI),p=(p0,…,pK)。通常,基于全連接層的K+1個(gè)輸出通過(guò)Softmax來(lái)計(jì)算p。第二個(gè)輸出層輸出bounding-box回歸偏移,即tk=(tkx, tky, tkw, tkh),k表示K個(gè)類(lèi)別的索引。我們使用[9]中給出方法對(duì)tk進(jìn)行參數(shù)化,其中tk指定相對(duì)于候選框的尺度不變轉(zhuǎn)換和對(duì)數(shù)空間高度/寬度移位。
Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:
in which Lcls(p, u) = - log pu is log loss for true class u.
每個(gè)訓(xùn)練的RoI用類(lèi)真值u和bounding-box回歸目標(biāo)真值v打上標(biāo)簽。我們對(duì)每個(gè)標(biāo)記的RoI使用多任務(wù)損失L以聯(lián)合訓(xùn)練分類(lèi)和bounding-box回歸:
其中Lcls(p, u) = - log pu,是類(lèi)真值u的log損失。
The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx, vy, vw, vh), and a predicted tuple tu = (tux ; tuy ; tuw; tuh ), again for class u. The Iverson bracket indicator function [u≥1] evaluates to 1 when u≥1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truth bounding box and hence Lloc is ignored. For bounding-box regression, we use the loss
in which
is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.
對(duì)于類(lèi)真值u,第二個(gè)損失Lloc是定義在bounding-box回歸目標(biāo)真值元組u, v = (vx, vy, vw, vh)和預(yù)測(cè)元組tu=(tux,tuy,tuw,tuh)上的損失。Iverson括號(hào)指示函數(shù)[u≥1],當(dāng)u≥1的時(shí)候值為1,否則為0。按照慣例,任何背景類(lèi)標(biāo)記為u=0。對(duì)于背景RoI,沒(méi)有檢測(cè)框真值的概念,因此Lloc被忽略。對(duì)于檢測(cè)框回歸,我們使用損失:
其中:
是魯棒的L1損失,對(duì)于異常值比在R-CNN和SPPnet中使用的L2損失更不敏感。當(dāng)回歸目標(biāo)無(wú)界時(shí),具有L2損失的訓(xùn)練可能需要仔細(xì)調(diào)整學(xué)習(xí)速率,以防止爆炸梯度。公式(3)消除了這種敏感性。
The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use? = 1.
公式(1)中的超參數(shù)λ控制兩個(gè)任務(wù)損失之間的平衡。我們將回歸目標(biāo)真值vi歸一化為具有零均值和方差為1的分布。所有實(shí)驗(yàn)都使用λ=1。
We note that [6] uses a related loss to train a class-agnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).
我們注意到[6]使用相關(guān)損失來(lái)訓(xùn)練一個(gè)類(lèi)別無(wú)關(guān)的目標(biāo)候選網(wǎng)絡(luò)。與我們的方法不同的是,[6]倡導(dǎo)一個(gè)將定位和分類(lèi)分離的雙網(wǎng)絡(luò)系統(tǒng)。OverFeat[19]、R-CNN[9]和SPPnet[11]也訓(xùn)練分類(lèi)器和檢測(cè)框定位器,但是這些方法使用逐級(jí)訓(xùn)練,這對(duì)于Fast R-CNN來(lái)說(shuō)不是最好的選擇。
Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u≥1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5], following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.
小批量采樣。在fine-tune期間,每個(gè)SGD的小批量由N=2個(gè)圖像構(gòu)成,均勻地隨機(jī)選擇(如通常的做法,我們實(shí)際上迭代數(shù)據(jù)集的排列)。 我們使用大小為R=128的小批量,從每個(gè)圖像采樣64個(gè)RoI。如在[9]中,我們從候選框中獲取25%的RoI,這些候選框與檢測(cè)框真值的交并比IoU至少為0.5。這些RoI只包括用前景對(duì)象類(lèi)標(biāo)記的樣本,即u≥1。根據(jù)[11],剩余的RoI從候選框中采樣,該候選框與檢測(cè)框真值的最大IoU在區(qū)間[0.1, 0.5]。這些是背景樣本,并用u=0標(biāo)記。0.1的閾值下限似乎充當(dāng)困難樣本重挖掘的啟發(fā)式算法[8]。在訓(xùn)練期間,圖像以概率0.5水平翻轉(zhuǎn)。不使用其他數(shù)據(jù)增強(qiáng)。
Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward because the forward pass treats all images independently.
通過(guò)RoI池化層的反向傳播。反向傳播通過(guò)RoI池化層。為了清楚起見(jiàn),我們假設(shè)每個(gè)小批量(N=1)只有一個(gè)圖像,擴(kuò)展到N>1是顯而易見(jiàn)的,因?yàn)榍跋騻鞑オ?dú)立地處理所有圖像。
Let xi∈R be the i-th activation input into the RoI pooling layer and let yrj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes yrj = xi*(r, j), in which i*(r, j) = argmax xi’ . R(r; j) is the index set of inputs in the sub-window over which the output unit yrj max pools. A single xi may be assigned to several different outputs yrj .
令RoI池化層的第i個(gè)激活輸入xi∈R,令yrj是第r個(gè)RoI層的第j個(gè)輸出。RoI池化層計(jì)算yrj = xi*(r, j),其中i*(r, j) = argmax xi’ . R(r; j)是輸出單元yrj最大池化的子窗口中的輸入的索引集合。一個(gè)xi可以被分配給幾個(gè)不同的輸出yrj。
The RoI pooling layer’s backwards function computes partial derivative of the loss function with respect to each input variable xi by following the argmax switches:
RoI池化層反向傳播函數(shù)通過(guò)遵循argmax switches來(lái)計(jì)算關(guān)于每個(gè)輸入變量xi的損失函數(shù)的偏導(dǎo)數(shù):
In words, for each mini-batch RoI r and for each pooling output unit yrj, the partial derivative ?L/?yrj is accumulated if i is the argmax selected for yrj by max pooling. In back-propagation, the partial derivatives ?L/?yrj are already computed by the backwards function of the layer on top of the RoI pooling layer.
換句話說(shuō),對(duì)于每個(gè)小批量RoI r和對(duì)于每個(gè)池化輸出單元yrj,如果i是yrj通過(guò)最大池化選擇的argmax,則將這個(gè)偏導(dǎo)數(shù)?L/?yrj積累下來(lái)。在反向傳播中,偏導(dǎo)數(shù)?L/?yrj已經(jīng)由RoI池化層頂部的層的反向傳播函數(shù)計(jì)算。
SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0.0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0:9 and parameter decay of 0.0005 (on weights and biases) are used.
SGD超參數(shù)。用于Softmax分類(lèi)和檢測(cè)框回歸的全連接層的權(quán)重分別使用具有方差0.01和0.001的零均值高斯分布初始化。偏置初始化為0。所有層的權(quán)重學(xué)習(xí)率為1倍的全局學(xué)習(xí)率,偏置為2倍的全局學(xué)習(xí)率,全局學(xué)習(xí)率為0.001。當(dāng)對(duì)VOC07或VOC12 trainval訓(xùn)練時(shí),我們進(jìn)行30k次小批量SGD迭代,然后將學(xué)習(xí)率降低到0.0001,再訓(xùn)練10k次迭代。當(dāng)我們訓(xùn)練更大的數(shù)據(jù)集,我們運(yùn)行SGD更多的迭代,如下文所述。使用0.9的動(dòng)量和0.0005的參數(shù)衰減(權(quán)重和偏置)。
2.4. Scale invariance
We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.
2.4. 尺度不變性
我們探索兩種實(shí)現(xiàn)尺度不變目標(biāo)檢測(cè)的方法:(1)通過(guò)“brute force”學(xué)習(xí)和(2)通過(guò)使用圖像金字塔。這些策略遵循[11]中的兩種方法。在“brute force”方法中,在訓(xùn)練和測(cè)試期間以預(yù)定義的像素大小處理每個(gè)圖像。網(wǎng)絡(luò)必須直接從訓(xùn)練數(shù)據(jù)學(xué)習(xí)尺度不變性目標(biāo)檢測(cè)。
The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.
相反,多尺度方法通過(guò)圖像金字塔向網(wǎng)絡(luò)提供近似尺度不變性。 在測(cè)試時(shí),圖像金字塔用于大致縮放-規(guī)范化每個(gè)候選框。按照[11]中的方法,作為數(shù)據(jù)增強(qiáng)的一種形式,在多尺度訓(xùn)練期間,我們?cè)诿看螆D像采樣時(shí)隨機(jī)采樣金字塔尺度。由于GPU內(nèi)存限制,我們只對(duì)較小的網(wǎng)絡(luò)進(jìn)行多尺度訓(xùn)練。
3. Fast R-CNN detection
Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. At test-time, R is typically around 2000, although we will consider cases in which it is larger (≈45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area [11].
3. Fast R-CNN檢測(cè)
一旦Fast R-CNN網(wǎng)絡(luò)被fine-tune完畢,檢測(cè)相當(dāng)于運(yùn)行前向傳播(假設(shè)候選框是預(yù)先計(jì)算的)。網(wǎng)絡(luò)將輸入圖像(或圖像金字塔,編碼為圖像列表)和待計(jì)算得分的R個(gè)候選框的列表作為輸入。在測(cè)試的時(shí)候,R通常在2000左右,盡管我們需要考慮更大(約45k)的情況。當(dāng)使用圖像金字塔時(shí),每個(gè)RoI被縮放,使其最接近[11]中的2242個(gè)像素。
For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k|r) ? pk. We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].
對(duì)于每個(gè)測(cè)試的RoI r,正向傳播輸出類(lèi)別后驗(yàn)概率分布p和相對(duì)于r的預(yù)測(cè)檢測(cè)框偏移集合(K個(gè)類(lèi)別中的每個(gè)類(lèi)別獲得其自己的修正的檢測(cè)框預(yù)測(cè)結(jié)果)。我們使用估計(jì)的概率Pr(class=k|r)?pk為每個(gè)對(duì)象類(lèi)別k分配r的檢測(cè)置信度。然后,我們使用R-CNN [9]算法的設(shè)置和算法對(duì)每個(gè)類(lèi)別獨(dú)立執(zhí)行非極大值抑制。
3.1. Truncated SVD for faster detection
For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].
Figure 2. Timing for VGG16 before and after truncated SVD. Before SVD, fully connected layers fc6 and fc7 take 45% of the time.
3.1. 使用截?cái)嗟腟VD實(shí)現(xiàn)更快的檢測(cè)
對(duì)于整個(gè)圖像進(jìn)行分類(lèi)任務(wù)時(shí),與卷積層相比,計(jì)算全連接層花費(fèi)的時(shí)間較小。相反,在圖像檢測(cè)任務(wù)中,要處理大量的RoI,并且接近一半的前向傳播時(shí)間用于計(jì)算全連接層(參見(jiàn)圖2)。較大的全連接層可以輕松地通過(guò)用截短的SVD[5, 23]壓縮來(lái)提升速度。
圖2. 截?cái)?/strong>SVD之前和之后VGG16的時(shí)間分布。在SVD之前,全連接層fc6和fc7消耗45%的時(shí)間。
In this technique, a layer parameterized by the u × v weight matrix W is approximately factorized as
using SVD. In this factorization, U is a u × t matrix comprising the first t left-singular vectors of W, Et is a t diagonal matrix containing the top t singular values of W, and V is v × t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix EtVT (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.
在這種技術(shù)中,層的u × v權(quán)重矩陣W通過(guò)SVD被近似分解為:
在這種分解中,U是一個(gè)u×t的矩陣,包括W的前t個(gè)左奇異向量,Et是t×t對(duì)角矩陣,其包含W的前t個(gè)奇異值,并且V是v×t矩陣,包括W的前t個(gè)右奇異向量。截?cái)郤VD將參數(shù)計(jì)數(shù)從uv減少到t(u+v)個(gè),如果t遠(yuǎn)小于min(u,v),則是非常有意義的。為了壓縮網(wǎng)絡(luò),對(duì)應(yīng)于W的單個(gè)全連接層由兩個(gè)全連接層替代,在它們之間沒(méi)有非線性。這些層中的第一層使用權(quán)重矩陣EtVT(沒(méi)有偏置),并且第二層使用U(其中原始偏差與W相關(guān)聯(lián))。當(dāng)RoI的數(shù)量較大時(shí),這種簡(jiǎn)單的壓縮方法能實(shí)現(xiàn)很好的加速。
4. Main results
Three main results support this paper’s contributions:
1. State-of-the-art mAP on VOC07, 2010, and 2012
2. Fast training and testing compared to R-CNN, SPPnet
3. Fine-tuning conv layers in VGG16 improves mAP
4. 主要結(jié)果
三個(gè)主要結(jié)果支持本文的貢獻(xiàn):
1. VOC07,2010和2012的最高的mAP。
2. 相比R-CNN、SPPnet,訓(xùn)練和測(cè)試的速度更快。
3. 對(duì)VGG16卷積層Fine-tuning后提升了mAP。
4.1. Experimental setup
Our experiments use three pre-trained ImageNet models that are available online. The first is the CaffeNet (essentially AlexNet [14]) from R-CNN [9]. We alternatively refer to this CaffeNet as model S, for “small.” The second network is VGG_CNN_M_1024 from [3], which has the same depth as S, but is wider. We call this network model M, for “medium.” The final network is the very deep VGG16 model from [20]. Since this model is the largest, we call it model L. In this section, all experiments use single-scale training and testing (s = 600; see Section 5.2 for details).
4.1. 實(shí)驗(yàn)設(shè)置
我們的實(shí)驗(yàn)使用了三個(gè)經(jīng)過(guò)預(yù)訓(xùn)練的ImageNet網(wǎng)絡(luò)模型,這些模型可以在線獲得(腳注:https://github.com/BVLC/caffe/wiki/Model-Zoo)。第一個(gè)是來(lái)自R-CNN [9]的CaffeNet(實(shí)質(zhì)上是AlexNet[14])。 我們將這個(gè)CaffeNet稱(chēng)為模型S,即小模型。第二網(wǎng)絡(luò)是來(lái)自[3]的VGG_CNN_M_1024,其具有與S相同的深度,但是更寬。我們把這個(gè)網(wǎng)絡(luò)模型稱(chēng)為M,即中等模型。最后一個(gè)網(wǎng)絡(luò)是來(lái)自[20]的非常深的VGG16模型。由于這個(gè)模型是最大的,我們稱(chēng)之為L(zhǎng)。在本節(jié)中,所有實(shí)驗(yàn)都使用單尺度訓(xùn)練和測(cè)試(s=600,詳見(jiàn)5.2節(jié))。
4.2. VOC 2010 and 2012 results
On these datasets, we compare Fast R-CNN (FRCN, for short) against the top methods on the comp4 (outside data) track from the public leaderboard (Table 2, Table 3). For the NUS_NIN_c2000 and BabyLearning methods, there are no associated publications at this time and we could not find exact information on the ConvNet architectures used; they are variants of the Network-in-Network design [17]. All other methods are initialized from the same pre-trained VGG16 network.
Table 2. VOC 2010 test detection average precision (%). BabyLearning uses a network based on [17]. All other methods use VGG16. Training set key: 12: VOC12 trainval, Prop.: proprietary dataset, 12+seg: 12 with segmentation annotations, 07++12: union of VOC07 trainval, VOC07 test, and VOC12 trainval.
Table 3. VOC 2012 test detection average precision (%). BabyLearning and NUS NIN c2000 use networks based on [17]. All other methods use VGG16. Training set key: see Table 2, Unk.: unknown.
4.2. VOC 2010和2012數(shù)據(jù)集上的結(jié)果
(如上面表2,表3所示)在這些數(shù)據(jù)集上,我們比較Fast R-CNN(簡(jiǎn)稱(chēng)FRCN)和公共排行榜中comp4(外部數(shù)據(jù))上的主流方法(腳注:http://host.robots.ox.ac.uk:8080/leaderboard)。對(duì)于NUS_NIN_c2000和BabyLearning方法,目前沒(méi)有相關(guān)的出版物,我們無(wú)法找到有關(guān)所使用的ConvNet體系結(jié)構(gòu)的確切信息;它們是Network-in-Network的變體[17]。所有其他方法都通過(guò)相同的預(yù)訓(xùn)練VGG16網(wǎng)絡(luò)進(jìn)行了初始化。
表2. VOC 2010測(cè)試檢測(cè)平均精度(%)。BabyLearning使用基于[17]的網(wǎng)絡(luò)。所有其他方法使用VGG16。訓(xùn)練集關(guān)鍵字:12代表VOC12 trainval,Prop.代表專(zhuān)有數(shù)據(jù)集,12+seg代表具有分割注釋的VOC2012,07++12代表VOC2007 trainval、VOC2007 test和VOC2012 trainval的組合。
表3. VOC 2012測(cè)試檢測(cè)平均精度(%)。BabyLearning和NUS_NIN_c2000使用基于[17]的網(wǎng)絡(luò)。所有其他方法使用VGG16。訓(xùn)練設(shè)置:見(jiàn)表2,Unk.代表未知。
Fast R-CNN achieves the top result on VOC12 with a mAP of 65.7% (and 68.4% with extra data). It is also two orders of magnitude faster than the other methods, which are all based on the “slow” R-CNN pipeline. On VOC10, SegDeepM [25] achieves a higher mAP than Fast R-CNN (67.2% vs. 66.1%). SegDeepM is trained on VOC12 trainval plus segmentation annotations; it is designed to boost R-CNN accuracy by using a Markov random field to reason over R-CNN detections and segmentations from the O2P [1] semantic-segmentation method. Fast R-CNN can be swapped into SegDeepM in place of R-CNN, which may lead to better results. When using the enlarged 07++12 training set (see Table 2 caption), Fast R-CNN’s mAP increases to 68.8%, surpassing SegDeepM.
Fast R-CNN在VOC12上獲得最高結(jié)果,mAP為65.7%(加上額外數(shù)據(jù)為68.4%)。它也比其他方法快兩個(gè)數(shù)量級(jí),這些方法都基于比較“慢”的R-CNN網(wǎng)絡(luò)。在VOC10上,SegDeepM [25]獲得了比Fast R-CNN更高的mAP(67.2%對(duì)比66.1%)。SegDeepM使用VOC12 trainval訓(xùn)練集及分割標(biāo)注進(jìn)行了訓(xùn)練,它被設(shè)計(jì)為通過(guò)使用馬爾可夫隨機(jī)場(chǎng)推理R-CNN檢測(cè)和來(lái)自O(shè)2P [1]的語(yǔ)義分割方法的分割來(lái)提高R-CNN精度。Fast R-CNN可以替換SegDeepM中使用的R-CNN,這可以獲得更好的結(jié)果。當(dāng)使用擴(kuò)大的07++12訓(xùn)練集(見(jiàn)表2標(biāo)題)時(shí),Fast R-CNN的mAP增加到68.8%,超過(guò)SegDeepM。
4.3. VOC 2007 results
On VOC07, we compare Fast R-CNN to R-CNN and SPPnet. All methods start from the same pre-trained VGG16 network and use bounding-box regression. The VGG16 SPPnet results were computed by the authors of [11]. SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (from 63.1% to 66.9%). R-CNN achieves a mAP of 66.0%. As a minor point, SPPnet was trained without examples marked as “difficult” in PASCAL. Removing these examples improves Fast R-CNN mAP to 68.1%. All other experiments use “difficult” examples.
4.3. VOC 2007數(shù)據(jù)集上的結(jié)果
在VOC07數(shù)據(jù)集上,我們比較Fast R-CNN與R-CNN和SPPnet的mAP。所有方法從相同的預(yù)訓(xùn)練VGG16網(wǎng)絡(luò)開(kāi)始,并使用bounding-box回歸。VGG16 SPPnet結(jié)果由論文[11]的作者提供。SPPnet在訓(xùn)練和測(cè)試期間使用五個(gè)尺度。Fast R-CNN對(duì)SPPnet的改進(jìn)說(shuō)明,即使Fast R-CNN使用單個(gè)尺度訓(xùn)練和測(cè)試,卷積層fine-tuning在mAP中貢獻(xiàn)了很大的改進(jìn)(從63.1%到66.9%)。R-CNN的mAP為66.0%。其次,SPPnet是在PASCAL中沒(méi)有被標(biāo)記為“困難”的樣本上進(jìn)行了訓(xùn)練。除去這些樣本,Fast R-CNN的mAP達(dá)到68.1%。所有其他實(shí)驗(yàn)都使用被標(biāo)記為“困難”的樣本。
4.4. Training and testing time
Fast training and testing times are our second main result. Table 4 compares training time (hours), testing rate (seconds per image), and mAP on VOC07 between Fast RCNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN processes images 146× faster than R-CNN without truncated SVD and 213× faster with it. Training time is reduced by 9×, from 84 hours to 9.5. Compared to SPPnet, Fast RCNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and tests 7× faster without truncated SVD or 10× faster with it. Fast R-CNN also eliminates hundreds of gigabytes of disk storage, because it does not cache features.
Table 4. Runtime comparison between the same models in Fast RCNN,
R-CNN, and SPPnet. Fast R-CNN uses single-scale mode. SPPnet uses the five scales specified in [11]. ?Timing provided by the authors of [11]. Times were measured on an Nvidia K40 GPU.
4.4. 訓(xùn)練和測(cè)試時(shí)間
快速的訓(xùn)練和測(cè)試是我們的第二個(gè)主要成果。表4比較了Fast RCNN,R-CNN和SPPnet之間的訓(xùn)練時(shí)間(單位小時(shí)),測(cè)試速率(每秒圖像數(shù))和VOC07上的mAP。對(duì)于VGG16,沒(méi)有截?cái)郤VD的Fast R-CNN處理圖像比R-CNN快146倍,有截?cái)郤VD的R-CNN快213倍。訓(xùn)練時(shí)間減少9倍,從84小時(shí)減少到9.5小時(shí)。與SPPnet相比,沒(méi)有截?cái)郤VD的Fast RCNN訓(xùn)練VGG16網(wǎng)絡(luò)比SPPnet快2.7倍(9.5小時(shí)相比于25.5小時(shí)),測(cè)試時(shí)間快7倍,有截?cái)郤VD的Fast RCNN比的SPPnet快10倍。Fast R-CNN還不需要數(shù)百GB的磁盤(pán)存儲(chǔ),因?yàn)樗痪彺嫣卣鳌?/p>
表4. Fast RCNN,R-CNN和SPPnet中相同模型之間的運(yùn)行時(shí)間比較。Fast R-CNN使用單尺度模式。SPPnet使用[11]中指定的五個(gè)尺度。?的時(shí)間由[11]的作者提供。在Nvidia K40 GPU上的進(jìn)行了時(shí)間測(cè)量。
Truncated SVD. Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression. Fig. 2 illustrates how using the top 1024 singular values from the 25088×4096 matrix in VGG16’s fc6 layer and the top 256 singular values from the 4096×4096 fc7 layer reduces runtime with little loss in mAP. Further speed-ups are possible with smaller drops in mAP if one fine-tunes again after compression.
截?cái)嗟?/strong>SVD。截?cái)嗟腟VD可以將檢測(cè)時(shí)間減少30%以上,同時(shí)能保持mAP只有很小(0.3個(gè)百分點(diǎn))的下降,并且無(wú)需在模型壓縮后執(zhí)行額外的fine-tune。圖2顯示了如何使用來(lái)自VGG16的fc6層中的25088×4096矩陣的頂部1024個(gè)奇異值和來(lái)自fc7層的4096×4096矩陣的頂部256個(gè)奇異值減少運(yùn)行時(shí)間,而mAP幾乎沒(méi)有損失。如果在壓縮之后再次fine-tune,則可以在mAP更小下降的情況下進(jìn)一步提升速度。
4.5. Which layers to fine-tune?
For the less deep networks considered in the SPPnet paper [11], fine-tuning only the fully connected layers appeared to be sufficient for good accuracy. We hypothesized that this result would not hold for very deep networks. To validate that fine-tuning the conv layers is important for VGG16, we use Fast R-CNN to fine-tune, but freeze the thirteen conv layers so that only the fully connected layers learn. This ablation emulates single-scale SPPnet training and decreases mAP from 66.9% to 61.4% (Table 5). This experiment verifies our hypothesis: training through the RoI pooling layer is important for very deep nets.
Table 5. Effect of restricting which layers are fine-tuned for VGG16. Fine-tuning ≥ fc6 emulates the SPPnet training algorithm [11], but using a single scale. SPPnet L results were obtained using five scales, at a significant (7×) speed cost.
4.5. fine-tune哪些層?
對(duì)于SPPnet論文[11]中提到的不太深的網(wǎng)絡(luò),僅fine-tuning全連接層似乎足以獲得良好的準(zhǔn)確度。我們假設(shè)這個(gè)結(jié)果不適用于非常深的網(wǎng)絡(luò)。為了驗(yàn)證fine-tune卷積層對(duì)于VGG16的重要性,我們使用Fast R-CNN進(jìn)行fine-tune,但凍結(jié)十三個(gè)卷積層,以便只有全連接層學(xué)習(xí)。這種消融模擬了單尺度SPPnet訓(xùn)練,將mAP從66.9%降低到61.4%(如表5所示)。這個(gè)實(shí)驗(yàn)驗(yàn)證了我們的假設(shè):通過(guò)RoI池化層的訓(xùn)練對(duì)于非常深的網(wǎng)是重要的。
表5. 對(duì)VGG16 fine-tune的層進(jìn)行限制產(chǎn)生的影響。fine-tune fc6以上的層模擬了單尺度SPPnet訓(xùn)練算法[11]。SPPnet L是使用五個(gè)尺度,以顯著(7倍)的速度成本獲得的結(jié)果。
Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3_1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2_1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3_1; and (2) updating from conv1_1 over-runs GPU memory. The difference in mAP when learning from conv2_1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3_1 and up; all experiments with models S and M fine-tune layers conv2 and up.
這是否意味著所有卷積層應(yīng)該進(jìn)行fine-tune?簡(jiǎn)而言之,不是的。在較小的網(wǎng)絡(luò)(S和M)中,我們發(fā)現(xiàn)conv1(譯者注:第一個(gè)卷積層)是通用的、不依賴(lài)于特定任務(wù)的(一個(gè)眾所周知的事實(shí)[14])。允許conv1學(xué)習(xí)或不學(xué)習(xí),對(duì)mAP沒(méi)有很關(guān)鍵的影響。對(duì)于VGG16,我們發(fā)現(xiàn)只需要更新conv3_1及以上(13個(gè)卷積層中的9個(gè))的層。這個(gè)觀察結(jié)果是實(shí)用的:(1)與從conv3_1更新相比,從conv2_1更新使訓(xùn)練變慢1.3倍(12.5小時(shí)對(duì)比9.5小時(shí)),(2)從conv1_1更新時(shí)GPU內(nèi)存不夠用。從conv2_1學(xué)習(xí)時(shí)mAP僅增加0.3個(gè)點(diǎn)(如表5最后一列所示)。本文中所有Fast R-CNN的結(jié)果都fine-tune VGG16 conv3_1及以上的層,所有用模型S和M的實(shí)驗(yàn)fine-tune conv2及以上的層。
5. Design evaluation
We conducted experiments to understand how Fast R-CNN compares to R-CNN and SPPnet, as well as to evaluate design decisions. Following best practices, we performed these experiments on the PASCAL VOC07 dataset.
5. 設(shè)計(jì)評(píng)估
我們通過(guò)實(shí)驗(yàn)來(lái)了解Fast RCNN與R-CNN和SPPnet的比較,以及評(píng)估設(shè)計(jì)決策。按照最佳實(shí)踐,我們?cè)赑ASCAL VOC07數(shù)據(jù)集上進(jìn)行了這些實(shí)驗(yàn)。
5.1. Does multi-task training help?
Multi-task training is convenient because it avoids managing a pipeline of sequentially-trained tasks. But it also has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet) [2]. Does multi-task training improve object detection accuracy in Fast R-CNN?
5.1. 多任務(wù)訓(xùn)練有用嗎?
多任務(wù)訓(xùn)練是方便的,因?yàn)樗苊夤芾眄樞蛴?xùn)練任務(wù)的pipeline。但它也有可能改善結(jié)果,因?yàn)槿蝿?wù)通過(guò)共享的表示(ConvNet)[2]相互影響。多任務(wù)訓(xùn)練能提高Fast R-CNN中的目標(biāo)檢測(cè)精度嗎?
To test this question, we train baseline networks that use only the classification loss, Lcls, in Eq. 1 (i.e., setting λ= 0). These baselines are printed for models S, M, and L in the first column of each group in Table 6. Note that these models do not have bounding-box regressors. Next (second column per group), we take networks that were trained with the multi-task loss (Eq. 1, λ=1), but we disable bounding-box regression at test time. This isolates the networks’ classification accuracy and allows an apples-to-apples comparison with the baseline networks.
Table 6. Multi-task training (forth column per group) improves mAP over piecewise training (third column per group).
為了測(cè)試這個(gè)問(wèn)題,我們訓(xùn)練僅使用公式(1)中的分類(lèi)損失Lcls(即設(shè)置λ=0)的基準(zhǔn)網(wǎng)絡(luò)。這些baselines是表6中每組的第一列。請(qǐng)注意,這些模型沒(méi)有bounding-box回歸器。接下來(lái)(每組的第二列),是我們采用多任務(wù)損失(公式(1),λ=1)訓(xùn)練的網(wǎng)絡(luò),但是我們?cè)跍y(cè)試時(shí)禁用bounding-box回歸。這隔離了網(wǎng)絡(luò)的分類(lèi)準(zhǔn)確性,并允許與基準(zhǔn)網(wǎng)絡(luò)的相似類(lèi)別之類(lèi)的比較(譯者注:apples-to-apples comparision意思是比較兩個(gè)相同類(lèi)別的事或物)。
表6. 多任務(wù)訓(xùn)練(每組第四列)改進(jìn)了分段訓(xùn)練(每組第三列)的mAP。
Across all three networks we observe that multi-task training improves pure classification accuracy relative to training for classification alone. The improvement ranges from +0.8 to +1.1 mAP points, showing a consistent positive effect from multi-task learning.
在所有三個(gè)網(wǎng)絡(luò)中,我們觀察到多任務(wù)訓(xùn)練相對(duì)于單獨(dú)的分類(lèi)訓(xùn)練提高了純分類(lèi)準(zhǔn)確度。改進(jìn)范圍從+0.8到+1.1個(gè)mAP點(diǎn),顯示了多任務(wù)學(xué)習(xí)的一致的積極效果。
Finally, we take the baseline models (trained with only the classification loss), tack on the bounding-box regression layer, and train them with Lloc while keeping all other network parameters frozen. The third column in each group shows the results of this stage-wise training scheme: mAP improves over column one, but stage-wise training underperforms multi-task training (forth column per group).
最后,我們采用baseline模型(僅使用分類(lèi)損失進(jìn)行訓(xùn)練),加上bounding-box回歸層,并使用Lloc訓(xùn)練它們,同時(shí)保持所有其他網(wǎng)絡(luò)參數(shù)凍結(jié)。每組中的第三列顯示了這種逐級(jí)訓(xùn)練方案的結(jié)果:mAP相對(duì)于第一列有改進(jìn),但逐級(jí)訓(xùn)練表現(xiàn)不如多任務(wù)訓(xùn)練(每組第四列)。
5.2. Scale invariance: to brute force or finesse?
We compare two strategies for achieving scale-invariant object detection: brute-force learning (single scale) and image pyramids (multi-scale). In either case, we define the scale s of an image to be the length of its shortest side.
5.2. 尺度不變性:暴力或精細(xì)?
我們比較兩個(gè)策略實(shí)現(xiàn)尺度不變物體檢測(cè):暴力學(xué)習(xí)(單尺度)和圖像金字塔(多尺度)。在任一情況下,我們將尺度s定義為圖像短邊的長(zhǎng)度。
All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.
所有單尺度實(shí)驗(yàn)使用s=600像素,對(duì)于一些圖像,s可以小于600,因?yàn)槲覀儽3謾M縱比縮放圖像,并限制其最長(zhǎng)邊為1000像素。選擇這些值使得VGG16在fine-tune期間不至于GPU內(nèi)存不足。較小的模型占用顯存更少,所以可受益于較大的s值。然而,每個(gè)模型的優(yōu)化不是我們的主要的關(guān)注點(diǎn)。我們注意到PASCAL圖像平均大小是384×473像素的,因此單尺度設(shè)置通常以1.6的倍數(shù)對(duì)圖像進(jìn)行上采樣。因此,RoI池化層的平均有效步長(zhǎng)約為10像素。
In the multi-scale setting, we use the same five scales specified in [11] (s ∈ {480, 576, 688, 864, 1200}) to facilitate comparison with SPPnet. However, we cap the longest side at 2000 pixels to avoid exceeding GPU memory.
在多尺度模型配置中,我們使用[11]中指定的相同的五個(gè)尺度(s∈{480,576,688,864,1200}),以方便與SPPnet進(jìn)行比較。但是,我們限制長(zhǎng)邊最大為2000像素,以避免GPU內(nèi)存不足。
Table 7 shows models S and M when trained and tested with either one or five scales. Perhaps the most surprising result in [11] was that single-scale detection performs almost as well as multi-scale detection. Our findings confirm their result: deep ConvNets are adept at directly learning scale invariance. The multi-scale approach offers only a small increase in mAP at a large cost in compute time (Table 7). In the case of VGG16 (model L), we are limited to using a single scale by implementation details. Yet it achieves a mAP of 66.9%, which is slightly higher than the 66.0% reported for R-CNN [10], even though R-CNN uses “infinite” scales in the sense that each proposal is warped to a canonical size.
Table 7. Multi-scale vs. single scale. SPPnet ZF (similar to model S) results are from [11]. Larger networks with a single-scale offer the best speed / accuracy tradeoff. (L cannot use multi-scale in our implementation due to GPU memory constraints.)
表7顯示了當(dāng)使用一個(gè)或五個(gè)尺度進(jìn)行訓(xùn)練和測(cè)試時(shí)的模型S和M的結(jié)果。也許在[11]中最令人驚訝的結(jié)果是單尺度檢測(cè)幾乎與多尺度檢測(cè)一樣好。我們的研究結(jié)果能證明他們的結(jié)果:深度卷積網(wǎng)絡(luò)擅長(zhǎng)直接學(xué)習(xí)到尺度的不變性。多尺度方法消耗大量的計(jì)算時(shí)間僅帶來(lái)了很小的mAP提升(表7)。在VGG16(模型L)的情況下,我們實(shí)現(xiàn)細(xì)節(jié)限制而只能使用單個(gè)尺度。然而,它得到了66.9%的mAP,略高于R-CNN[10]的66.0%,盡管R-CNN在某種意義上使用了“無(wú)限”尺度,但每個(gè)候選區(qū)域還是被縮放為規(guī)范大小。
表7. 多尺度對(duì)比單尺度。SPPnet ZF(類(lèi)似于模型S)的結(jié)果來(lái)自[11]。具有單尺度的較大網(wǎng)絡(luò)具有最佳的速度/精度平衡。(L在我們的實(shí)現(xiàn)中不能使用多尺度,因?yàn)镚PU內(nèi)存限制。)
Since single-scale processing offers the best tradeoff between speed and accuracy, especially for very deep models, all experiments outside of this sub-section use single-scale training and testing with s = 600 pixels.
由于單尺度處理能夠權(quán)衡好速度和精度之間的關(guān)系,特別是對(duì)于非常深的模型,本小節(jié)以外的所有實(shí)驗(yàn)使用單尺度s=600像素的尺度進(jìn)行訓(xùn)練和測(cè)試。
5.3. Do we need more training data?
A good object detector should improve when supplied with more training data. Zhu et al. [24] found that DPM [8] mAP saturates after only a few hundred to thousand training examples. Here we augment the VOC07 trainval set with the VOC12 trainval set, roughly tripling the number of images to 16.5k, to evaluate Fast R-CNN. Enlarging the training set improves mAP on VOC07 test from 66.9% to 70.0% (Table 1). When training on this dataset we use 60k mini-batch iterations instead of 40k.
Table 1. VOC 2007 test detection average precision (%). All methods use VGG16. Training set key: 07: VOC07 trainval, 07\diff: 07 without “difficult” examples, 07+12: union of 07 and VOC12 trainval. ySPPnet results were prepared by the authors of [11].
5.3. 我們需要更多訓(xùn)練數(shù)據(jù)嗎?
當(dāng)提供更多的訓(xùn)練數(shù)據(jù)時(shí),好的目標(biāo)檢測(cè)器應(yīng)該會(huì)進(jìn)一步提升性能。Zhu等人[24]發(fā)現(xiàn)DPM [8]模型的mAP在只有幾百到幾千個(gè)訓(xùn)練樣本的時(shí)候就達(dá)到飽和了。實(shí)驗(yàn)中我們?cè)黾覸OC07 trainval訓(xùn)練集與VOC12 trainval訓(xùn)練集,大約增加到三倍的圖像使其數(shù)量達(dá)到16.5k,以評(píng)估Fast R-CNN。擴(kuò)大訓(xùn)練集將VOC07測(cè)試集的mAP從66.9%提高到70.0%(表1)。使用這個(gè)數(shù)據(jù)集進(jìn)行訓(xùn)練時(shí),我們使用60k次小批量迭代而不是40k。
表1. VOC 2007測(cè)試檢測(cè)平均精度(%)。所有方法都使用VGG16。 訓(xùn)練集關(guān)鍵字:07代表VOC07 trainval,07\diff代表07沒(méi)有“困難”的樣本,07 + 12表示VOC07和VOC12 trainval的組合。SPPnet結(jié)果由[11]的作者提供。
We perform similar experiments for VOC10 and 2012, for which we construct a dataset of 21.5k images from the union of VOC07 trainval, test, and VOC12 trainval. When training on this dataset, we use 100k SGD iterations and lower the learning rate by 0.1× each 40k iterations (instead of each 30k). For VOC10 and 2012, mAP improves from 66.1% to 68.8% and from 65.7% to 68.4%, respectively.
我們對(duì)VOC2010和2012進(jìn)行類(lèi)似的實(shí)驗(yàn),我們用VOC07 trainval、test和VOC12 trainval數(shù)據(jù)集構(gòu)造了21.5k個(gè)圖像的數(shù)據(jù)集。當(dāng)用這個(gè)數(shù)據(jù)集訓(xùn)練時(shí),我們使用100k次SGD迭代,并且每40k次迭代(而不是每30k次)降低學(xué)習(xí)率10倍。對(duì)于VOC2010和2012,mAP分別從66.1%提高到68.8%和從65.7%提高到68.4%。
5.4. Do SVMs outperform softmax?
Fast R-CNN uses the softmax classifier learnt during fine-tuning instead of training one-vs-rest linear SVMs post-hoc, as was done in R-CNN and SPPnet. To understand the impact of this choice, we implemented post-hoc SVM training with hard negative mining in Fast R-CNN. We use the same training algorithm and hyper-parameters as in R-CNN.
5.4. SVM分類(lèi)是否優(yōu)于Softmax?
Fast R-CNN在fine-tune期間使用softmax分類(lèi)器學(xué)習(xí),而不是像R-CNN和SPPnet在最后訓(xùn)練一對(duì)多線性SVM。為了理解這種選擇的影響,我們?cè)贔ast R-CNN中進(jìn)行了具有難負(fù)采樣的事后SVM訓(xùn)練。我們使用與R-CNN中相同的訓(xùn)練算法和超參數(shù)。
Table 8 shows softmax slightly outperforming SVM for all three networks, by +0.1 to +0.8 mAP points. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches. We note that softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI.
Table 8. Fast R-CNN with softmax vs. SVM (VOC07 mAP).
如表8所示,對(duì)于所有三個(gè)網(wǎng)絡(luò),Softmax略優(yōu)于SVM,mAP分別提高了0.1和0.8個(gè)點(diǎn)。這個(gè)提升效果很小,但是它表明與先前的多級(jí)訓(xùn)練方法相比,“一次性”fine-tune是足夠的。我們注意到,不像一對(duì)多的SVM那樣,Softmax會(huì)在計(jì)算RoI得分時(shí)引入類(lèi)別之間的競(jìng)爭(zhēng)。
表8. 用Softmax的Fast R-CNN對(duì)比用SVM的Fast RCNN(VOC07 mAP)。
5.5. Are more proposals always better?
There are (broadly) two types of object detectors: those that use a sparse set of object proposals (e.g., selective search [21]) and those that use a dense set (e.g., DPM [8]). Classifying sparse proposals is a type of cascade [22] in which the proposal mechanism first rejects a vast number of candidates leaving the classifier with a small set to evaluate. This cascade improves detection accuracy when applied to DPM detections [21]. We find evidence that the proposal-classifier cascade also improves Fast R-CNN accuracy.
5.5. 更多的候選區(qū)域更好嗎?
(廣義上)存在兩種類(lèi)型的目標(biāo)檢測(cè)器:一類(lèi)是使用候選區(qū)域稀疏集合檢測(cè)器(例如,selective search [21])和另一類(lèi)使用密集集合(例如DPM [8])。分類(lèi)稀疏候選區(qū)域通過(guò)一種級(jí)聯(lián)方式[22]的,其中候選機(jī)制首先舍棄大量候選區(qū)域,留下較小的集合讓分類(lèi)器來(lái)評(píng)估。當(dāng)應(yīng)用于DPM檢測(cè)時(shí),這種級(jí)聯(lián)的方式提高了檢測(cè)精度[21]。我們發(fā)現(xiàn)proposal-classifier級(jí)聯(lián)方式也提高了Fast R-CNN的精度。
Using selective search’s quality mode, we sweep from 1k to 10k proposals per image, each time re-training and re-testing model M. If proposals serve a purely computational role, increasing the number of proposals per image should not harm mAP.
使用selective search的質(zhì)量模式,我們對(duì)每個(gè)圖像掃描1k到10k個(gè)候選框,每次重新訓(xùn)練和重新測(cè)試模型M。如果候選框純粹扮演計(jì)算的角色,增加每個(gè)圖像的候選框數(shù)量不會(huì)影響mAP。
We find that mAP rises and then falls slightly as the proposal count increases (Fig. 3, solid blue line). This experiment shows that swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.
我們發(fā)現(xiàn)隨著候選區(qū)域數(shù)量的增加,mAP先上升然后略微下降(如圖3藍(lán)色實(shí)線所示)。這個(gè)實(shí)驗(yàn)表明,深度神經(jīng)網(wǎng)絡(luò)分類(lèi)器使用更多的候選區(qū)域沒(méi)有幫助,甚至稍微有點(diǎn)影響準(zhǔn)確性。
This result is difficult to predict without actually running the experiment. The state-of-the-art for measuring object proposal quality is Average Recall (AR) [12]. AR correlates well with mAP for several proposal methods using R-CNN, when using a fixed number of proposals per image. Fig. 3 shows that AR (solid red line) does not correlate well with mAP as the number of proposals per image is varied. AR must be used with care; higher AR due to more proposals does not imply that mAP will increase. Fortunately, training and testing with model M takes less than 2.5 hours. Fast R-CNN thus enables efficient, direct evaluation of object proposal mAP, which is preferable to proxy metrics.
Figure 3. VOC07 test mAP and AR for various proposal schemes.
如果不實(shí)際進(jìn)行實(shí)驗(yàn),這個(gè)結(jié)果很難預(yù)測(cè)。用于評(píng)估候選區(qū)域質(zhì)量的最流行的技術(shù)是平均召回率(Average Recall, AR) [12]。當(dāng)對(duì)每個(gè)圖像使用固定數(shù)量的候選區(qū)域時(shí),AR與使用R-CNN的幾種候選區(qū)域方法時(shí)的mAP具有良好的相關(guān)性。圖3表明AR(紅色實(shí)線)與mAP不相關(guān),因?yàn)槊總€(gè)圖像的候選區(qū)域數(shù)量是變化的。AR必須謹(jǐn)慎使用,由于更多的候選區(qū)域會(huì)得到更高的AR,然而這并不意味著mAP也會(huì)增加。幸運(yùn)的是,使用模型M的訓(xùn)練和測(cè)試需要不到2.5小時(shí)。因此,Fast R-CNN能夠高效地、直接地評(píng)估目標(biāo)候選區(qū)域的mAP,是很合適的代理指標(biāo)。
圖3. 各種候選區(qū)域方案下VOC07測(cè)試的mAP和AR。
We also investigate Fast R-CNN when using densely generated boxes (over scale, position, and aspect ratio), at a rate of about 45k boxes / image. This dense set is rich enough that when each selective search box is replaced by its closest (in IoU) dense box, mAP drops only 1 point (to 57.7%, Fig. 3, blue triangle).
我們還研究了當(dāng)使用密集生成框(在不同縮放尺度、位置和寬高比上)大約45k個(gè)框/圖像比例時(shí)的Fast R-CNN網(wǎng)絡(luò)模型。這個(gè)密集集足夠大,當(dāng)每個(gè)selective search框被其最近(IoU)密集框替換時(shí),mAP只降低1個(gè)點(diǎn)(到57.7%,如圖3藍(lán)色三角形所示)。
The statistics of the dense boxes differ from those of selective search boxes. Starting with 2k selective search boxes, we test mAP when adding a random sample of 1000×{2,4,6,8,10,32,45} dense boxes. For each experiment we re-train and re-test model M. When these dense boxes are added, mAP falls more strongly than when adding more selective search boxes, eventually reaching 53.0%.
密集框的統(tǒng)計(jì)信息與selective search框的統(tǒng)計(jì)信息不同。從2k個(gè)selective search框開(kāi)始,我們?cè)購(gòu)?000×{2,4,6,8,10,32,45}中隨機(jī)添加密集框,并測(cè)試mAP。對(duì)于每個(gè)實(shí)驗(yàn),我們重新訓(xùn)練和重新測(cè)試模型M。當(dāng)添加這些密集框時(shí),mAP比添加更多選擇性搜索框時(shí)下降得更強(qiáng),最終達(dá)到53.0%。
We also train and test Fast R-CNN using only dense boxes (45k / image). This setting yields a mAP of 52.9% (blue diamond). Finally, we check if SVMs with hard negative mining are needed to cope with the dense box distribution. SVMs do even worse: 49.3% (blue circle).
我們還訓(xùn)練和測(cè)試了Fast R-CNN只使用密集框(45k/圖像)。此設(shè)置的mAP為52.9%(藍(lán)色菱形)。最后,我們檢查是否需要使用難樣本重訓(xùn)練的SVM來(lái)處理密集框分布。SVM結(jié)果更糟糕:49.3%(藍(lán)色圓圈)。
5.6. Preliminary MS COCO results
We applied Fast R-CNN (with VGG16) to the MS COCO dataset [18] to establish a preliminary baseline. We trained on the 80k image training set for 240k iterations and evaluated on the “test-dev” set using the evaluation server. The PASCAL-style mAP is 35.9%; the new COCO-style AP, which also averages over IoU thresholds, is 19.7%.
5.6. MS COCO初步結(jié)果
我們將Fast R-CNN(使用VGG16)應(yīng)用于MS COCO數(shù)據(jù)集[18],以建立初始baseline。我們?cè)?0k圖像訓(xùn)練集上進(jìn)行了240k次迭代訓(xùn)練,并使用評(píng)估服務(wù)器對(duì)“test-dev”數(shù)據(jù)集進(jìn)行評(píng)估。PASCAL形式的mAP為35.9%;新的COCO標(biāo)準(zhǔn)下的AP為19.7%,即超過(guò)IoU閾值的平均值。
6. Conclusion
This paper proposes Fast R-CNN, a clean and fast update to R-CNN and SPPnet. In addition to reporting state-of-the-art detection results, we present detailed experiments that we hope provide new insights. Of particular note, sparse object proposals appear to improve detector quality. This issue was too costly (in time) to probe in the past, but becomes practical with Fast R-CNN. Of course, there may exist yet undiscovered techniques that allow dense boxes to perform as well as sparse proposals. Such methods, if developed, may help further accelerate object detection.
6. 結(jié)論
本文提出Fast R-CNN,一個(gè)對(duì)R-CNN和SPPnet更新的簡(jiǎn)潔、快速版本。除了報(bào)告目前最先進(jìn)的檢測(cè)結(jié)果之外,我們還提供了詳細(xì)的實(shí)驗(yàn),希望提供新的思路。特別值得注意的是,稀疏目標(biāo)候選區(qū)域似乎提高了檢測(cè)器的質(zhì)量。過(guò)去這個(gè)問(wèn)題代價(jià)太大(在時(shí)間上)而一直無(wú)法深入探索,但Fast R-CNN使其變得可能。當(dāng)然,可能存在未發(fā)現(xiàn)的技術(shù),使得密集框能夠達(dá)到與稀疏候選框類(lèi)似的效果。如果這樣的方法被開(kāi)發(fā)出來(lái),則可以幫助進(jìn)一步加速目標(biāo)檢測(cè)。
Acknowledgements. I thank Kaiming He, Larry Zitnick, and Piotr Doll′ar for helpful discussions and encouragement.
致謝:感謝Kaiming He,Larry Zitnick和Piotr Dollár的有益的討論和鼓勵(lì)。
References
參考文獻(xiàn)
[1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV,2012. 5
[2] R. Caruana. Multitask learning. Machine learning, 28(1), 1997. 6
[3] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014. 5
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 2
[5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014. 4
[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 3
[7] M. Everingham, L. Van Gool, C. K. I.Williams, J.Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 1
[8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 3, 7, 8
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3, 4, 8
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Regionbased convolutional networks for accurate object detection and segmentation. TPAMI, 2015. 5, 7, 8
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2, 3, 4, 5, 6, 7
[12] J. H. Hosang, R. Benenson, P. Doll′ar, and B. Schiele. What makes for effective detection proposals? arXiv preprint arXiv:1502.05082, 2015. 8
[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. of the ACM International Conf. on Multimedia, 2014. 2
[14] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 1, 4, 6
[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1
[16] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comp., 1989. 1
[17] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014. 5
[18] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll′ar, and C. L. Zitnick. Microsoft COCO: common objects in context. arXiv e-prints, arXiv:1405.0312 [cs.CV], 2014. 8
[19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In ICLR, 2014. 1, 3
[20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 5
[21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. 8
[22] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 8
[23] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, 2013. 4
[24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we need more training data or better models for object detection? In BMVC, 2012. 7
[25] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. In CVPR, 2015. 1, 5
總結(jié)
以上是生活随笔為你收集整理的目标检测经典论文——Fast R-CNN论文翻译(中英文对照版):Fast R-CNN(Ross Girshick, Microsoft Research(微软研究院))的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: vs2015发布网站到IIS
- 下一篇: Fast R-CNN论文原理+目标检测笔