當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《Spatial Transformer Networks》的翻译与解读

發(fā)布時(shí)間：2025/3/21 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《Spatial Transformer Networks》的翻译与解读小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Paper：《Spatial Transformer Networks》的翻譯與解讀

《Spatial Transformer Networks》的翻譯與解讀

Abstract

1 Introduction ?

2 Related Work ?

3 Spatial Transformers ?

3.1 Localisation Network ?

3.2 Parameterised Sampling Grid ?

3.3 Differentiable Image Sampling ?

3.4 Spatial Transformer Networks ?

4 Experiments ?

4.1 Distorted MNIST ?

4.2 Street View House Numbers ?

4.3 Fine-Grained Classification ?

5 Conclusion

《Spatial Transformer Networks》的翻譯與解讀

鏈接	https://arxiv.org/pdf/1506.02025.pdf
作者	Max Jaderberg Karen Simonyan Andrew Zisserman Koray Kavukcuoglu Google DeepMind, London, UK {jaderberg,simonyan,zisserman,korayk}@google.com

Abstract

Convolutional Neural Networks define an exceptionally powerful class of models, ?but are still limited by the lack of ability to be spatially invariant to the input data ?in a computationally and parameter efficient manner. In this work we introduce a ?new learnable module, the Spatial Transformer, which explicitly allows the spatial ?manipulation of data within the network. This differentiable module can be ?inserted into existing convolutional architectures, giving neural networks the ability ?to actively spatially transform feature maps, conditional on the feature map ?itself, without any extra training supervision or modification to the optimisation ?process. We show that the use of spatial transformers results in models which ?learn invariance to translation, scale, rotation and more generic warping, resulting ?in state-of-the-art performance on several benchmarks, and for a number of ?classes of transformations.

卷積神經(jīng)網(wǎng)絡(luò)定義了一種非常強(qiáng)大的模型，但仍然受到限制，因?yàn)?span style="color:#f33b45;">在計(jì)算和參數(shù)有效的方式下，缺乏對(duì)輸入數(shù)據(jù)的空間不變性。在這項(xiàng)工作中，我們引入了一個(gè)新的可學(xué)習(xí)模塊，空間轉(zhuǎn)換器，它明確地允許對(duì)網(wǎng)絡(luò)內(nèi)的數(shù)據(jù)進(jìn)行空間操作。這個(gè)可微模塊可以插入到現(xiàn)有的卷積架構(gòu)中，使神經(jīng)網(wǎng)絡(luò)能夠以特征映射本身為條件主動(dòng)對(duì)特征映射進(jìn)行空間變換，而無(wú)需任何額外的訓(xùn)練監(jiān)督或修改優(yōu)化過(guò)程。我們表明，空間轉(zhuǎn)換器的使用會(huì)導(dǎo)致模型學(xué)習(xí)到平移、縮放、旋轉(zhuǎn)和更一般的扭曲的不變性，從而在幾個(gè)基準(zhǔn)測(cè)試和許多類(lèi)轉(zhuǎn)換上獲得最先進(jìn)的性能。

1 Introduction ?

Over recent years, the landscape of computer vision has been drastically altered and pushed forward ?through the adoption of a fast, scalable, end-to-end learning framework, the Convolutional Neural ?Network (CNN) [21]. Though not a recent invention, we now see a cornucopia of CNN-based ?models achieving state-of-the-art results in classification [19, 28, 35], localisation [31, 37], semantic ?segmentation [24], and action recognition [12, 32] tasks, amongst others. ? A desirable property of a system which is able to reason about images is to disentangle object ?pose and part deformation from texture and shape. The introduction of local max-pooling layers in ?CNNs has helped to satisfy this property by allowing a network to be somewhat spatially invariant ?to the position of features. However, due to the typically small spatial support for max-pooling ?(e.g. 2 × 2 pixels) this spatial invariance is only realised over a deep hierarchy of max-pooling and ?convolutions, and the intermediate feature maps (convolutional layer activations) in a CNN are not ?actually invariant to large transformations of the input data [6, 22]. This limitation of CNNs is due ?to having only a limited, pre-defined pooling mechanism for dealing with variations in the spatial ?arrangement of data.	近年來(lái)，通過(guò)采用快速、可擴(kuò)展、端到端學(xué)習(xí)框架——卷積神經(jīng)網(wǎng)絡(luò)(CNN)[21]，計(jì)算機(jī)視覺(jué)領(lǐng)域發(fā)生了翻天覆地的變化。雖然不是最近才發(fā)明的，但我們現(xiàn)在看到大量基于cnn的模型在分類(lèi)[19,28,35]、定位[31,37]、語(yǔ)義分割[24]和動(dòng)作識(shí)別[12,32]任務(wù)等方面取得了最先進(jìn)的結(jié)果。一個(gè)能夠?qū)D像進(jìn)行推理的系統(tǒng)的一個(gè)理想特性是將物體的姿態(tài)和部分變形從紋理和形狀中分離出來(lái)。在cnn中引入局部最大池層有助于滿足這一特性，因?yàn)樗试S網(wǎng)絡(luò)對(duì)特征的位置具有一定的空間不變性。然而，由于典型的對(duì)最大池的空間支持很小(例如:這種空間不變性僅在max-pooling和convolutions的深層層次上實(shí)現(xiàn)，而CNN中的中間特征映射(convolutional layer activation)對(duì)于輸入數(shù)據(jù)的大變換實(shí)際上并不是不變的[6,22]。cnn的這種局限性是由于只有一種有限的、預(yù)定義的池機(jī)制來(lái)處理數(shù)據(jù)空間安排的變化。
In this work we introduce a Spatial Transformer module, that can be included into a standard neural ?network architecture to provide spatial transformation capabilities. The action of the spatial transformer ?is conditioned on individual data samples, with the appropriate behaviour learnt during training ?for the task in question (without extra supervision). Unlike pooling layers, where the receptive ?fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively ?spatially transform an image (or a feature map) by producing an appropriate transformation for each ?input sample. The transformation is then performed on the entire feature map (non-locally) and ?can include scaling, cropping, rotations, as well as non-rigid deformations. This allows networks ?which include spatial transformers to not only select regions of an image that are most relevant (attention), ?but also to transform those regions to a canonical, expected pose to simplify recognition in ?the following layers. Notably, spatial transformers can be trained with standard back-propagation, ?allowing for end-to-end training of the models they are injected in.	在這項(xiàng)工作中，我們介紹了一個(gè)空間轉(zhuǎn)換器模塊，它可以包含在一個(gè)標(biāo)準(zhǔn)的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)中，以提供空間轉(zhuǎn)換能力。空間轉(zhuǎn)換器的動(dòng)作以個(gè)體數(shù)據(jù)樣本為條件，并在任務(wù)訓(xùn)練中學(xué)習(xí)到適當(dāng)?shù)男袨?沒(méi)有額外的監(jiān)督)。與接受域是固定和局部的池化層不同，空間轉(zhuǎn)換器模塊是一種動(dòng)態(tài)機(jī)制，通過(guò)為每個(gè)輸入樣本生成適當(dāng)?shù)霓D(zhuǎn)換，可以主動(dòng)地對(duì)圖像(或特征地圖)進(jìn)行空間轉(zhuǎn)換。然后在整個(gè)特征圖(非局部)上執(zhí)行轉(zhuǎn)換，可以包括縮放、剪切、旋轉(zhuǎn)以及非剛性變形。這使得包含空間變形器的網(wǎng)絡(luò)不僅可以選擇圖像中最相關(guān)的區(qū)域(注意)，而且可以將這些區(qū)域轉(zhuǎn)換成規(guī)范的、預(yù)期的姿態(tài)，從而簡(jiǎn)化以下層中的識(shí)別。值得注意的是，空間轉(zhuǎn)換器可以用標(biāo)準(zhǔn)的反向傳播進(jìn)行訓(xùn)練，允許對(duì)它們所注入的模型進(jìn)行端到端的訓(xùn)練。
Figure 1: The result of using a spatial transformer as the ?first layer of a fully-connected network trained for distorted ?MNIST digit classification. (a) The input to the spatial transformer ?network is an image of an MNIST digit that is distorted ?with random translation, scale, rotation, and clutter. (b) ?The localisation network of the spatial transformer predicts a ?transformation to apply to the input image. (c) The output ?of the spatial transformer, after applying the transformation. ?(d) The classification prediction produced by the subsequent ?fully-connected network on the output of the spatial transformer. ?The spatial transformer network (a CNN including a ?spatial transformer module) is trained end-to-end with only ?class labels – no knowledge of the groundtruth transformations ?is given to the system.	圖1:使用空間轉(zhuǎn)換器作為變形MNIST數(shù)字分類(lèi)訓(xùn)練的全連接網(wǎng)絡(luò)的第一層的結(jié)果。(a)空間變壓器網(wǎng)絡(luò)的輸入是被隨機(jī)平移、縮放、旋轉(zhuǎn)和雜波扭曲的MNIST數(shù)字的圖像。(b)空間轉(zhuǎn)換器的定位網(wǎng)絡(luò)預(yù)測(cè)將對(duì)輸入圖像進(jìn)行轉(zhuǎn)換。(c)空間變壓器應(yīng)用變換后的輸出。(d)隨后的全連接網(wǎng)絡(luò)在空間變壓器的輸出上產(chǎn)生的分類(lèi)預(yù)測(cè)。空間變壓器網(wǎng)絡(luò)(包括空間變壓器模塊的CNN)只使用類(lèi)標(biāo)簽進(jìn)行端到端的訓(xùn)練——沒(méi)有向系統(tǒng)提供關(guān)于groundtruth轉(zhuǎn)換的知識(shí)。
Spatial transformers can be incorporated into CNNs to benefit multifarious tasks, for example: ?(i) image classification: suppose a CNN is trained to perform multi-way classification of images ?according to whether they contain a particular digit – where the position and size of the digit may ?vary significantly with each sample (and are uncorrelated with the class); a spatial transformer that ?crops out and scale-normalizes the appropriate region can simplify the subsequent classification ?task, and lead to superior classification performance, see Fig. 1; (ii) co-localisation: given a set of ?images containing different instances of the same (but unknown) class, a spatial transformer can be ?used to localise them in each image; (iii) spatial attention: a spatial transformer can be used for ?tasks requiring an attention mechanism, such as in [14, 39], but is more flexible and can be trained ?purely with backpropagation without reinforcement learning. A key benefit of using attention is that ?transformed (and so attended), lower resolution inputs can be used in favour of higher resolution ?raw inputs, resulting in increased computational efficiency. ? The rest of the paper is organised as follows: Sect. 2 discusses some work related to our own, we ?introduce the formulation and implementation of the spatial transformer in Sect. 3, and finally give ?the results of experiments in Sect. 4. Additional experiments and implementation details are given ?in Appendix A.	空間轉(zhuǎn)換器可以被納入CNN受益繁雜的任務(wù),例如:(i)圖像分類(lèi):假設(shè)一個(gè)CNN訓(xùn)練來(lái)執(zhí)行多路圖像分類(lèi)根據(jù)他們是否包含一個(gè)特定的數(shù)字,數(shù)字可能會(huì)有所不同的位置和大小明顯與每個(gè)樣本(和不相關(guān)的類(lèi));裁剪和尺度歸一化適當(dāng)區(qū)域的空間轉(zhuǎn)換器可以簡(jiǎn)化后續(xù)的分類(lèi)任務(wù)，并導(dǎo)致更高的分類(lèi)性能，見(jiàn)圖1;(ii)共定位:給定一組包含相同(但未知)類(lèi)的不同實(shí)例的圖像，空間轉(zhuǎn)換器可以用于在每個(gè)圖像中定位它們;(3)空間注意:空間轉(zhuǎn)換器可以用于需要注意機(jī)制的任務(wù)，如[14,39]，但更靈活，可以單純用反向傳播進(jìn)行訓(xùn)練，無(wú)需強(qiáng)化學(xué)習(xí)。使用attention的一個(gè)關(guān)鍵好處是，轉(zhuǎn)換(因此參與)的低分辨率輸入可以用于更高分辨率的原始輸入，從而提高計(jì)算效率。本文的其余部分組織如下:第2節(jié)討論了與我們相關(guān)的一些工作，第3節(jié)介紹了空間轉(zhuǎn)換器的設(shè)計(jì)和實(shí)現(xiàn)，最后給出了第4節(jié)的實(shí)驗(yàn)結(jié)果。附錄A給出了更多的實(shí)驗(yàn)和實(shí)現(xiàn)細(xì)節(jié)。

2 Related Work ?

In this section we discuss the prior work related to the paper, covering the central ideas of modelling ?transformations with neural networks [15, 16, 36], learning and analysing transformation-invariant ?representations [4, 6, 10, 20, 22, 33], as well as attention and detection mechanisms for feature ?selection [1, 7, 11, 14, 27, 29]. ? Early work by Hinton [15] looked at assigning canonical frames of reference to object parts, a theme ?which recurred in [16] where 2D affine transformations were modeled to create a generative model ?composed of transformed parts. The targets of the generative training scheme are the transformed ?input images, with the transformations between input images and targets given as an additional ?input to the network. The result is a generative model which can learn to generate transformed ?images of objects by composing parts. The notion of a composition of transformed parts is taken ?further by Tieleman [36], where learnt parts are explicitly affine-transformed, with the transform ?predicted by the network. Such generative capsule models are able to learn discriminative features ?for classification from transformation supervision. ?	在本節(jié)中,我們討論了之前的相關(guān)工作,與神經(jīng)網(wǎng)絡(luò)覆蓋模型轉(zhuǎn)換的核心觀點(diǎn)(15、16,36),學(xué)習(xí)和分析transformation-invariant表示(4、6、10、20、22、33),以及注意力和檢測(cè)機(jī)制特征選擇(1、7、11、14,27歲,29)。  Hinton[15]的早期工作是將標(biāo)準(zhǔn)的參考框架分配給對(duì)象部件，這是[16]中反復(fù)出現(xiàn)的主題，在這里，2D仿射轉(zhuǎn)換被建模，以創(chuàng)建由轉(zhuǎn)換部件組成的生成模型。生成訓(xùn)練方案的目標(biāo)是轉(zhuǎn)換后的輸入圖像，輸入圖像與目標(biāo)之間的轉(zhuǎn)換作為網(wǎng)絡(luò)的額外輸入。其結(jié)果是一個(gè)生成模型，該模型可以通過(guò)組成部件來(lái)學(xué)習(xí)生成轉(zhuǎn)換后的物體圖像。Tieleman[36]進(jìn)一步提出了由轉(zhuǎn)換部分組成的概念，學(xué)習(xí)到的部分通過(guò)網(wǎng)絡(luò)預(yù)測(cè)的變換進(jìn)行明確的仿射變換。這種生成膠囊模型能夠從轉(zhuǎn)換監(jiān)督中學(xué)習(xí)判別特征進(jìn)行分類(lèi)。
The invariance and equivariance of CNN representations to input image transformations are studied ?in [22] by estimating the linear relationships between representations of the original and transformed ?images. Cohen & Welling [6] analyse this behaviour in relation to symmetry groups, which is also ?exploited in the architecture proposed by Gens & Domingos [10], resulting in feature maps that are ?more invariant to symmetry groups. Other attempts to design transformation invariant representations ?are scattering networks [4], and CNNs that construct filter banks of transformed filters [20, 33]. ?Stollenga et al. [34] use a policy based on a network’s activations to gate the responses of the network’s ?filters for a subsequent forward pass of the same image and so can allow attention to specific ?features. In this work, we aim to achieve invariant representations by manipulating the data rather ?than the feature extractors, something that was done for clustering in [9].	在[22]中，通過(guò)估計(jì)原始圖像和轉(zhuǎn)換后圖像的表示之間的線性關(guān)系，研究了CNN表示對(duì)輸入圖像轉(zhuǎn)換的不變性和等效性。Cohen和Welling[6]分析了這種與對(duì)稱群相關(guān)的行為，這也被Gens和Domingos[10]提出的架構(gòu)所利用，從而產(chǎn)生了對(duì)對(duì)稱群更不變的特征映射。設(shè)計(jì)變換不變表示的其他嘗試包括散射網(wǎng)絡(luò)[4]和構(gòu)造變換濾波器組的CNNs[20,33]。Stollenga等人[34]使用一種基于網(wǎng)絡(luò)激活的策略來(lái)屏蔽網(wǎng)絡(luò)過(guò)濾器的響應(yīng)，以便后續(xù)轉(zhuǎn)發(fā)相同的圖像，從而允許關(guān)注特定的特征。在這項(xiàng)工作中，我們的目標(biāo)是通過(guò)操縱數(shù)據(jù)而不是特征提取器來(lái)實(shí)現(xiàn)不變表示，這在[9]中是為了聚類(lèi)而做的。
Figure 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation ?network which regresses the transformation parameters θ. The regular spatial grid G over V is transformed to ?the sampling grid Tθ(G), which is applied to U as described in Sect. 3.3, producing the warped output feature ?map V . The combination of the localisation network and sampling mechanism defines a spatial transformer.	圖2:空間變壓器模塊的架構(gòu)。輸入特征映射U被傳遞到一個(gè)定位網(wǎng)絡(luò)，該網(wǎng)絡(luò)回歸轉(zhuǎn)換參數(shù)θ。將規(guī)則空間網(wǎng)格G / V轉(zhuǎn)換為采樣網(wǎng)格Tθ(G)，如3.3節(jié)所述，將采樣網(wǎng)格應(yīng)用于U，產(chǎn)生扭曲的輸出特征映射V。定位網(wǎng)絡(luò)和抽樣機(jī)制的結(jié)合定義了一個(gè)空間轉(zhuǎn)換器。
Neural networks with selective attention manipulate the data by taking crops, and so are able to learn ?translation invariance. Work such as [1, 29] are trained with reinforcement learning to avoid the need for a differentiable attention mechanism, while [14] use a differentiable attention mechansim ?by utilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region ?proposal algorithm as a form of attention, and [7] show that it is possible to regress salient regions ?with a CNN. The framework we present in this paper can be seen as a generalisation of differentiable ?attention to any spatial transformation.	具有選擇性注意的神經(jīng)網(wǎng)絡(luò)通過(guò)獲取作物來(lái)操縱數(shù)據(jù)，因此能夠?qū)W習(xí)翻譯不變性。像[1,29]這樣的工作通過(guò)強(qiáng)化學(xué)習(xí)進(jìn)行訓(xùn)練，以避免對(duì)可微分注意機(jī)制的需要，而[14]通過(guò)在生成模型中使用高斯核函數(shù)來(lái)使用可微分注意機(jī)制。Girshick等人的研究[11]使用區(qū)域建議算法作為注意的一種形式，[7]表明可以使用CNN回歸顯著區(qū)域。我們?cè)诒疚闹刑岢龅目蚣芸梢钥醋魇菍?duì)任何空間變換的可微注意的推廣。

3 Spatial Transformers ?

In this section we describe the formulation of a spatial transformer. This is a differentiable module ?which applies a spatial transformation to a feature map during a single forward pass, where the ?transformation is conditioned on the particular input, producing a single output feature map. For ?multi-channel inputs, the same warping is applied to each channel. For simplicity, in this section we ?consider single transforms and single outputs per transformer, however we can generalise to multiple ?transformations, as shown in experiments. ?

在本節(jié)中，我們將描述空間轉(zhuǎn)換器的公式。這是一個(gè)可微模塊，它在一個(gè)單獨(dú)的前向過(guò)程中對(duì)特征映射進(jìn)行空間變換，其中的變換以特定的輸入為條件，產(chǎn)生一個(gè)單獨(dú)的輸出特征映射。對(duì)于多通道輸入，對(duì)每個(gè)通道應(yīng)用相同的翹曲。為簡(jiǎn)單起見(jiàn)，在本節(jié)中，我們考慮每個(gè)變壓器的單一轉(zhuǎn)換和單一輸出，然而，我們可以推廣到多個(gè)轉(zhuǎn)換，如實(shí)驗(yàn)中所示。 

The spatial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation, ?first a localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden ?layers outputs the parameters of the spatial transformation that should be applied to the feature map ?– this gives a transformation conditional on the input. Then, the predicted transformation parameters ?are used to create a sampling grid, which is a set of points where the input map should be sampled to ?produce the transformed output. This is done by the grid generator, described in Sect. 3.2. Finally, ?the feature map and the sampling grid are taken as inputs to the sampler, producing the output map ?sampled from the input at the grid points (Sect. 3.3). ?
The combination of these three components forms a spatial transformer and will now be described ?in more detail in the following sections.

空間變換機(jī)構(gòu)分為三部分，如圖2所示。按照計(jì)算順序，首先定位網(wǎng)絡(luò)(第3.1節(jié))獲取輸入特征地圖，并通過(guò)若干隱藏層輸出應(yīng)該應(yīng)用于特征地圖的空間轉(zhuǎn)換參數(shù)——這將在輸入上給出一個(gè)有條件的轉(zhuǎn)換。然后，使用預(yù)測(cè)的轉(zhuǎn)換參數(shù)來(lái)創(chuàng)建一個(gè)采樣網(wǎng)格，該網(wǎng)格是一組應(yīng)該對(duì)輸入映射進(jìn)行采樣以產(chǎn)生轉(zhuǎn)換后的輸出的點(diǎn)。這是由第3.2節(jié)中描述的網(wǎng)格生成器完成的。最后，將特征映射和采樣網(wǎng)格作為采樣器的輸入，從網(wǎng)格點(diǎn)的輸入產(chǎn)生采樣的輸出映射(第3.3節(jié))。這三個(gè)組件的組合形成了一個(gè)空間轉(zhuǎn)換器，下面幾節(jié)將對(duì)其進(jìn)行更詳細(xì)的描述。

3.1 Localisation Network ?

The localisation network takes the input feature map U ∈ R ?H×W×C with width W, height H and ?C channels and outputs θ, the parameters of the transformation Tθ to be applied to the feature map: ?θ = floc(U). The size of θ can vary depending on the transformation type that is parameterised, ?e.g. for an affine transformation θ is 6-dimensional as in (10). ?The localisation network function floc() can take any form, such as a fully-connected network or ?a convolutional network, but should include a final regression layer to produce the transformation ?parameters θ.

定位網(wǎng)絡(luò)取輸入特征圖U∈R H×W×C，寬W，高H, C通道，輸出θ，應(yīng)用于特征圖的變換Tθ的參數(shù):θ = floc(U)。θ的大小可以根據(jù)參數(shù)化的轉(zhuǎn)換類(lèi)型而變化，例如。對(duì)于仿射變換，θ是6維的，如(10)。定位網(wǎng)絡(luò)函數(shù)floc()可以采取任何形式，例如完全連接的網(wǎng)絡(luò)或卷積網(wǎng)絡(luò)，但應(yīng)該包括一個(gè)最終的回歸層來(lái)產(chǎn)生轉(zhuǎn)換參數(shù)θ。

3.2 Parameterised Sampling Grid ?

To perform a warping of the input feature map, each output pixel is computed by applying a sampling ?kernel centered at a particular location in the input feature map (this is described fully in the next ?section). By pixel we refer to an element of a generic feature map, not necessarily an image. In ?general, the output pixels are defined to lie on a regular grid G = {Gi} of pixels Gi = (x ?t ?i ?, yt ?i ?), ?forming an output feature map V ∈ R ?H0×W0×C , where H0 ?and W0 ?are the height and width of the ?grid, and C is the number of channels, which is the same in the input and output.

要對(duì)輸入特征映射執(zhí)行扭曲，需要通過(guò)應(yīng)用以輸入特征映射中特定位置為中心的采樣核來(lái)計(jì)算每個(gè)輸出像素(下一節(jié)將對(duì)此進(jìn)行詳細(xì)描述)。像素指的是一般特征圖的一個(gè)元素，不一定是圖像。一般來(lái)說(shuō),躺在一個(gè)常規(guī)定義的輸出像素網(wǎng)格G = {Gi}像素Gi = (x t,歐美我),形成一個(gè)輸出特性映射V∈R H0×W0×C, H0和W0網(wǎng)格的高度和寬度,和C是通道的數(shù)量,輸入和輸出是相同的。

where (x ?t ?i ?, yt ?i ?) are the target coordinates of the regular grid in the output feature map, (x ?s ?i ?, ys ?i ?) are ?the source coordinates in the input feature map that define the sample points, and Aθ is the affine ?transformation matrix. We use height and width normalised coordinates, such that ?1 ≤ x ?t ?i ?, yt ?i ≤ 1 ?when within the spatial bounds of the output, and ?1 ≤ x ?s ?i ?, ys ?i ≤ 1 when within the spatial bounds ?of the input (and similarly for the y coordinates). The source/target transformation and sampling is ?equivalent to the standard texture mapping and coordinates used in graphics [8].

其中(x ti, yt i)為輸出特征映射中規(guī)則網(wǎng)格的目標(biāo)坐標(biāo)，(x s i, ys i)為定義樣本點(diǎn)的輸入特征映射中的源坐標(biāo)，Aθ為仿射變換矩陣。我們使用的高度和寬度正常化坐標(biāo),這樣?1≤x t我次我≤1時(shí)在空間范圍內(nèi)的輸出,和?1≤x, y≤1時(shí)在空間范圍內(nèi)的輸入(同樣的y坐標(biāo))。源/目標(biāo)轉(zhuǎn)換和采樣等價(jià)于圖形[8]中使用的標(biāo)準(zhǔn)紋理映射和坐標(biāo)。

The class of transformations Tθ may be more constrained, such as that used for attention ?Aθ = ? ?s 0 tx ?0 s ty ? ?(2) ?allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation ?Tθ can also be more general, such as a plane projective transformation with 8 parameters, piecewise ?affine, or a thin plate spline. Indeed, the transformation can have any parameterised form, ?provided that it is differentiable with respect to the parameters – this crucially allows gradients to be ?backpropagated through from the sample points Tθ(Gi) to the localisation network output θ. If the ?transformation is parameterised in a structured, low-dimensional way, this reduces the complexity ?of the task assigned to the localisation network. For instance, a generic class of structured and differentiable ?transformations, which is a superset of attention, affine, projective, and thin plate spline ?transformations, is Tθ = MθB, where B is a target grid representation (e.g. in (10), B is the regular ?grid G in homogeneous coordinates), and Mθ is a matrix parameterised by θ. In this case it is ?possible to not only learn how to predict θ for a sample, but also to learn B for the task at hand.

類(lèi)的轉(zhuǎn)換Tθ可能更多限制,比如用于注意θ= 0 0 tx泰(2)允許裁剪,翻譯,和各向同性縮放到不同年代,tx,泰,變換Tθ也可以更普遍,如使用8參數(shù)平面射影變換,分段仿射或薄板樣條。事實(shí)上，變換可以有任何參數(shù)化形式，只要它對(duì)參數(shù)是可微的——這至關(guān)重要地允許梯度通過(guò)樣本點(diǎn)Tθ(Gi)反向傳播到定位網(wǎng)絡(luò)輸出θ。如果轉(zhuǎn)換以結(jié)構(gòu)化、低維的方式參數(shù)化，這將降低分配給本地化網(wǎng)絡(luò)的任務(wù)的復(fù)雜性。例如,一個(gè)泛型類(lèi)的結(jié)構(gòu)化和可微的轉(zhuǎn)換,這是一個(gè)超集的關(guān)注,仿射,投影,和薄板樣條轉(zhuǎn)換,是M Tθ=θB, B是一個(gè)目標(biāo)網(wǎng)格表示(例如在(10),B是定期在齊次坐標(biāo)網(wǎng)格G),和Mθ是一個(gè)矩陣parameterisedθ。在這種情況下，不僅可以學(xué)習(xí)如何預(yù)測(cè)樣本的θ，而且可以學(xué)習(xí)當(dāng)前任務(wù)的B。

3.3 Differentiable Image Sampling ?

To perform a spatial transformation of the input feature map, a sampler must take the set of sampling ?points Tθ(G), along with the input feature map U and produce the sampled output feature map V . ?Each (x ?s ?i ?, ys ?i ?) coordinate in Tθ(G) defines the spatial location in the input where a sampling kernel ?is applied to get the value at a particular pixel in the output V . This can be written as

為了對(duì)輸入特征映射進(jìn)行空間變換，采樣器必須取采樣點(diǎn)集Tθ(G)，同時(shí)取輸入特征映射U，并產(chǎn)生采樣輸出特征映射V。Tθ(G)中的每個(gè)(x s i, ys i)坐標(biāo)定義了輸入中的空間位置，采樣核應(yīng)用于此，以獲得輸出V中特定像素的值。這可以寫(xiě)成

where Φx and Φy are the parameters of a generic sampling kernel k() which defines the image ?interpolation (e.g. bilinear), U ?c ?nm is the value at location (n, m) in channel c of the input, and V ?c ?i ?is the output value for pixel i at location (x ?t ?i ?, yt ?i ?) in channel c. Note that the sampling is done ?identically for each channel of the input, so every channel is transformed in an identical way (this ?preserves spatial consistency between channels). ?In theory, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to ?x ?s ?i ?and y ?s ?i ?. For example, using the integer sampling kernel reduces (3) to ?V ?c ?i = ?X ?H ?n ?X ?W ?m ?U ?c ?nmδ(bx ?s ?i + 0.5c ? m)δ(by ?s ?i + 0.5c ? n) (4) ?where bx + 0.5c rounds x to the nearest integer and δ() is the Kronecker delta function. This ?sampling kernel equates to just copying the value at the nearest pixel to (x ?s ?i ?, ys ?i ?) to the output location ?(x ?t ?i ?, yt ?i ?). Alternatively, a bilinear sampling kernel can be used, giving ?V ?c ?i = ?X ?H ?n ?X ?W ?m ?U ?c ?nm max(0, 1 ? |x ?s ?i ? m|) max(0, 1 ? |y ?s ?i ? n|) (5) ?To allow backpropagation of the loss through this sampling mechanism we can define the gradients ?with respect to U and G. For bilinear sampling (5) the partial derivatives are ??V c ?i ??Uc ?nm ?= ?X ?H ?n ?X ?W ?m ?max(0, 1 ? |x ?s ?i ? m|) max(0, 1 ? |y ?s ?i ? n|) (6) ??V c ?i ??xs ?i ?= ?X ?H ?n ?X ?W ?m ?U ?c ?nm max(0, 1 ? |y ?s ?i ? n|) ?? ?? ?? ?0 if |m ? x ?s ?i ?| ≥ 1 ?1 if m ≥ x ?s ?i ??1 if m < xs ?i ?(7) ?and similarly to (7) for ?V c ?i ??ys ?i ?.

This gives us a (sub-)differentiable sampling mechanism, allowing loss gradients to flow back not ?only to the input feature map (6), but also to the sampling grid coordinates (7), and therefore back ?to the transformation parameters θ and localisation network since ?xs ?i ??θ and ?xs ?i ??θ can be easily derived ?from (10) for example. Due to discontinuities in the sampling fuctions, sub-gradients must be used. ?This sampling mechanism can be implemented very efficiently on GPU, by ignoring the sum over ?all input locations and instead just looking at the kernel support region for each output pixel.

這給了我們一個(gè)(子)可微的采樣機(jī)制,不僅允許損失梯度回流的輸入特性圖(6),而且采樣網(wǎng)格坐標(biāo)(7),因此回轉(zhuǎn)換參數(shù)θ和本地化網(wǎng)絡(luò)自?x我?θ和?x我?θ可以很容易地由(10)為例。由于抽樣函數(shù)的不連續(xù)，必須使用次梯度。這種采樣機(jī)制可以在GPU上非常有效地實(shí)現(xiàn)，忽略所有輸入位置的總和，而只是查看每個(gè)輸出像素的內(nèi)核支持區(qū)域。

3.4 Spatial Transformer Networks ?

The combination of the localisation network, grid generator, and sampler form a spatial transformer ?(Fig. 2). This is a self-contained module which can be dropped into a CNN architecture at any point, ?and in any number, giving rise to spatial transformer networks. This module is computationally very ?fast and does not impair the training speed, causing very little time overhead when used naively, and ?even speedups in attentive models due to subsequent downsampling that can be applied to the output ?of the transformer. ?
Placing spatial transformers within a CNN allows the network to learn how to actively transform ?the feature maps to help minimise the overall cost function of the network during training. The ?knowledge of how to transform each training sample is compressed and cached in the weights of ?the localisation network (and also the weights of the layers previous to a spatial transformer) during ?training. For some tasks, it may also be useful to feed the output of the localisation network, θ, ?forward to the rest of the network, as it explicitly encodes the transformation, and hence the pose, of ?a region or object. ?

定位網(wǎng)絡(luò)、網(wǎng)格發(fā)生器和采樣器的組合形成了一個(gè)空間變壓器(圖。2).這是一個(gè)自包含的模塊，可以在任意點(diǎn)，任意數(shù)量的放入CNN架構(gòu)中，從而產(chǎn)生空間變壓器網(wǎng)絡(luò)。該模塊的計(jì)算速度非常快，不影響訓(xùn)練速度，在天真地使用時(shí)造成的時(shí)間開(kāi)銷(xiāo)非常小，甚至在細(xì)心的模型中加速，因?yàn)楹罄m(xù)的下采樣可以應(yīng)用到變壓器的輸出。在CNN中放置空間變壓器可以讓網(wǎng)絡(luò)學(xué)習(xí)如何積極地轉(zhuǎn)換特征圖，以幫助在訓(xùn)練期間最小化網(wǎng)絡(luò)的總體成本函數(shù)。在訓(xùn)練期間，如何轉(zhuǎn)換每個(gè)訓(xùn)練樣本的知識(shí)被壓縮并緩存在本地化網(wǎng)絡(luò)的權(quán)值中(以及空間轉(zhuǎn)換器之前的層的權(quán)值)。對(duì)于某些任務(wù)，它也可能是有用的供給定位網(wǎng)絡(luò)的輸出，θ，向前到網(wǎng)絡(luò)的其余部分，因?yàn)樗鞔_編碼轉(zhuǎn)換，因此姿態(tài)，一個(gè)區(qū)域或?qū)ο蟆?amp;nbsp;

Table 1: Left: The percentage errors for different models on different distorted MNIST datasets. The different ?distorted MNIST datasets we test are TC: translated and cluttered, R: rotated, RTS: rotated, translated, and ?scaled, P: projective distortion, E: elastic distortion. All the models used for each experiment have the same ?number of parameters, and same base structure for all experiments. Right: Some example test images where ?a spatial transformer network correctly classifies the digit but a CNN fails. (a) The inputs to the networks. (b) ?The transformations predicted by the spatial transformers, visualised by the grid Tθ(G). (c) The outputs of the ?spatial transformers. E and RTS examples use thin plate spline spatial transformers (ST-CNN TPS), while R ?examples use affine spatial transformers (ST-CNN Aff) with the angles of the affine transformations given. For ?videos showing animations of these experiments and more see https://goo.gl/qdEhUu.

表1:左:不同模型在不同失真MNIST數(shù)據(jù)集上的誤差百分比。我們測(cè)試的不同扭曲MNIST數(shù)據(jù)集是TC:平移和雜波，R:旋轉(zhuǎn)，RTS:旋轉(zhuǎn)，平移和縮放，P:投影失真，E:彈性失真。各實(shí)驗(yàn)所用模型參數(shù)數(shù)目相同，基本結(jié)構(gòu)相同。右圖:一些測(cè)試圖像的例子，其中空間變壓器網(wǎng)絡(luò)正確地分類(lèi)數(shù)字，但CNN失敗了。(a)網(wǎng)絡(luò)的輸入。(b)空間變壓器預(yù)測(cè)的變換，由網(wǎng)格Tθ(G)可視化。(c)空間變壓器的輸出。E和RTS的例子使用薄板樣條空間變壓器(ST-CNN TPS)，而R的例子使用仿射空間變壓器(ST-CNN Aff)，其仿射變換的角度是給定的。有關(guān)這些實(shí)驗(yàn)動(dòng)畫(huà)的視頻和更多內(nèi)容，請(qǐng)參見(jiàn)https://goo.gl/qdEhUu。

It is also possible to use spatial transformers to downsample or oversample a feature map, as one can ?define the output dimensions H0 ?and W0 ?to be different to the input dimensions H and W. However, ?with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsampling ?with a spatial transformer can cause aliasing effects.
Finally, it is possible to have multiple spatial transformers in a CNN. Placing multiple spatial transformers ?at increasing depths of a network allow transformations of increasingly abstract representations, ?and also gives the localisation networks potentially more informative representations to base ?the predicted transformation parameters on. One can also use multiple spatial transformers in parallel ?– this can be useful if there are multiple objects or parts of interest in a feature map that should be ?focussed on individually. A limitation of this architecture in a purely feed-forward network is that ?the number of parallel spatial transformers limits the number of objects that the network can model.

還可以使用空間變形金剛downsample或oversample功能地圖,作為一個(gè)可以定義的輸出尺寸H0和W0不同輸入維度H和w .然而,與一個(gè)固定的采樣內(nèi)核,小空間(如雙線性內(nèi)核)的支持,將采樣空間變壓器可能導(dǎo)致混疊效應(yīng)。最后，在一個(gè)CNN中可以有多個(gè)空間變壓器。在網(wǎng)絡(luò)的深度增加時(shí)放置多個(gè)空間轉(zhuǎn)換器，可以實(shí)現(xiàn)越來(lái)越抽象的表示形式的轉(zhuǎn)換，同時(shí)也為定位網(wǎng)絡(luò)提供了潛在的更有信息的表示形式，從而可以根據(jù)預(yù)測(cè)的轉(zhuǎn)換參數(shù)進(jìn)行轉(zhuǎn)換。還可以同時(shí)使用多個(gè)空間轉(zhuǎn)換器——如果在一個(gè)特征圖中有多個(gè)對(duì)象或感興趣的部分需要分別關(guān)注，這可能會(huì)很有用。在純前饋網(wǎng)絡(luò)中，這種架構(gòu)的一個(gè)限制是并行空間變壓器的數(shù)量限制了網(wǎng)絡(luò)可以建模的對(duì)象的數(shù)量。

4 Experiments ?

In this section we explore the use of spatial transformer networks on a number of supervised learning ?tasks. In Sect. 4.1 we begin with experiments on distorted versions of the MNIST handwriting ?dataset, showing the ability of spatial transformers to improve classification performance through ?actively transforming the input images. In Sect. 4.2 we test spatial transformer networks on a challenging ?real-world dataset, Street View House Numbers [25], for number recognition, showing stateof-the-art ?results using multiple spatial transformers embedded in the convolutional stack of a CNN. ?Finally, in Sect. 4.3, we investigate the use of multiple parallel spatial transformers for fine-grained ?classification, showing state-of-the-art performance on CUB-200-2011 birds dataset [38] by discovering ?object parts and learning to attend to them. Further experiments of MNIST addition and ?co-localisation can be found in Appendix A. ?

在本節(jié)中，我們將探索空間變壓器網(wǎng)絡(luò)在若干監(jiān)督學(xué)習(xí)任務(wù)中的使用。在4.1節(jié)中，我們首先對(duì)MNIST筆跡數(shù)據(jù)集的扭曲版本進(jìn)行實(shí)驗(yàn)，展示了空間變換器通過(guò)主動(dòng)轉(zhuǎn)換輸入圖像來(lái)提高分類(lèi)性能的能力。在第4.2節(jié)中，我們?cè)诰哂刑魬?zhàn)性的真實(shí)世界數(shù)據(jù)集上測(cè)試了空間變壓器網(wǎng)絡(luò)，街道視圖房號(hào)[25]，用于數(shù)字識(shí)別，使用嵌入在CNN卷積堆棧中的多個(gè)空間變壓器顯示了最先進(jìn)的結(jié)果。最后，在第4.3節(jié)中，我們研究了多個(gè)并行空間變形器用于細(xì)粒度分類(lèi)的使用，通過(guò)發(fā)現(xiàn)對(duì)象部件并學(xué)習(xí)注意它們，展示了cube -200-2011 birds數(shù)據(jù)集[38]的最先進(jìn)性能。進(jìn)一步的MNIST添加和共定位實(shí)驗(yàn)可以在附錄A中找到。

4.1 Distorted MNIST ?

In this section we use the MNIST handwriting dataset as a testbed for exploring the range of transformations ?to which a network can learn invariance to by using a spatial transformer.

?We begin with experiments where we train different neural network models to classify MNIST data ?that has been distorted in various ways: rotation (R), rotation, scale and translation (RTS), projective ?transformation (P), and elastic warping (E) – note that elastic warping is destructive and can not be ?inverted in some cases. The full details of the distortions used to generate this data are given in ?Appendix A. We train baseline fully-connected (FCN) and convolutional (CNN) neural networks, ?as well as networks with spatial transformers acting on the input before the classification network ?(ST-FCN and ST-CNN). The spatial transformer networks all use bilinear sampling, but variants use ?different transformation functions: an affine transformation (Aff), projective transformation (Proj), ?and a 16-point thin plate spline transformation (TPS) [2]. The CNN models include two max-pooling ?layers. All networks have approximately the same number of parameters, are trained with identical ?optimisation schemes (backpropagation, SGD, scheduled learning rate decrease, with a multinomial ?cross entropy loss), and all with three weight layers in the classification network.

在本節(jié)中，我們使用MNIST手寫(xiě)數(shù)據(jù)集作為測(cè)試平臺(tái)，來(lái)探索網(wǎng)絡(luò)可以通過(guò)使用空間轉(zhuǎn)換器學(xué)習(xí)到的不變性的轉(zhuǎn)換范圍。我們從實(shí)驗(yàn)開(kāi)始訓(xùn)練不同的神經(jīng)網(wǎng)絡(luò)模型分類(lèi)MNIST數(shù)據(jù)已經(jīng)以各種方式扭曲:旋轉(zhuǎn)(R)、旋轉(zhuǎn)、尺度和翻譯(RTS)、投影轉(zhuǎn)換(P)和彈性變形(E) -注意,彈性變形是毀滅性的和在某些情況下不能倒。用于生成這一數(shù)據(jù)的扭曲的全部細(xì)節(jié)見(jiàn)附錄a。我們訓(xùn)練基線全連接(FCN)和卷積(CNN)神經(jīng)網(wǎng)絡(luò)，以及在分類(lèi)網(wǎng)絡(luò)(ST-FCN和ST-CNN)之前使用空間變壓器作用于輸入的網(wǎng)絡(luò)。空間變壓器網(wǎng)絡(luò)都使用雙線性采樣，但不同的變體使用不同的變換函數(shù):仿射變換(Aff)、投影變換(Proj)和16點(diǎn)薄板樣條變換(TPS)[2]。CNN的模型包括兩個(gè)最大匯集層。所有的網(wǎng)絡(luò)具有近似相同的參數(shù)數(shù)目，使用相同的優(yōu)化方案(backpropagation, SGD，調(diào)度學(xué)習(xí)速率下降，有多項(xiàng)交叉熵?fù)p失)進(jìn)行訓(xùn)練，并且在分類(lèi)網(wǎng)絡(luò)中都有三個(gè)權(quán)層。

Table 2: Left: The sequence error for SVHN multi-digit recognition on crops of 64 × 64 pixels (64px), and ?inflated crops of 128 × 128 (128px) which include more background. *The best reported result from [1] uses ?model averaging and Monte Carlo averaging, whereas the results from other models are from a single forward ?pass of a single model. Right: (a) The schematic of the ST-CNN Multi model. The transformations applied by ?each spatial transformer (ST) is applied to the convolutional feature map produced by the previous layer. (b) ?The result of multiplying out the affine transformations predicted by the four spatial transformers in ST-CNN ?Multi, visualised on the input image.

表2:左:64 × 64像素(64px)作物的SVHN多位數(shù)識(shí)別序列錯(cuò)誤，128 × 128 (128px)膨大的作物包含更多的背景。*[1]報(bào)告的最佳結(jié)果使用了模型平均和蒙特卡羅平均，而其他模型的結(jié)果來(lái)自單個(gè)模型的單次向前傳遞。右:(a) ST-CNN多模型示意圖。每個(gè)空間變換器(ST)的變換應(yīng)用于前一層生成的卷積特征圖。(b)將ST-CNN Multi中的四個(gè)空間變壓器預(yù)測(cè)的仿射變換乘出來(lái)的結(jié)果，在輸入圖像上顯示出來(lái)。

The results of these experiments are shown in Table 1 (left). Looking at any particular type of distortion ?of the data, it is clear that a spatial transformer enabled network outperforms its counterpart ?base network. For the case of rotation, translation, and scale distortion (RTS), the ST-CNN achieves 0.5% and 0.6% depending on the class of transform used for Tθ, whereas a CNN, with two maxpooling ?layers to provide spatial invariance, achieves 0.8% error. This is in fact the same error that ?the ST-FCN achieves, which is without a single convolution or max-pooling layer in its network, ?showing that using a spatial transformer is an alternative way to achieve spatial invariance. ST-CNN ?models consistently perform better than ST-FCN models due to max-pooling layers in ST-CNN providing ?even more spatial invariance, and convolutional layers better modelling local structure. We ?also test our models in a noisy environment, on 60 × 60 images with translated MNIST digits and ?background clutter (see Fig. 1 third row for an example): an FCN gets 13.2% error, a CNN gets ?3.5% error, while an ST-FCN gets 2.0% error and an ST-CNN gets 1.7% error. ?
Looking at the results between different classes of transformation, the thin plate spline transformation ?(TPS) is the most powerful, being able to reduce error on elastically deformed digits by ?reshaping the input into a prototype instance of the digit, reducing the complexity of the task for the ?classification network, and does not over fit on simpler data e.g. R. Interestingly, the transformation ?of inputs for all ST models leads to a “standard” upright posed digit – this is the mean pose found ?in the training data. In Table 1 (right), we show the transformations performed for some test cases ?where a CNN is unable to correctly classify the digit, but a spatial transformer network can. Further ?test examples are visualised in an animation here https://goo.gl/qdEhUu.

實(shí)驗(yàn)結(jié)果見(jiàn)表1(左)。觀察任何特定類(lèi)型的數(shù)據(jù)失真，可以清楚地看出空間轉(zhuǎn)換器支持的網(wǎng)絡(luò)性能優(yōu)于其對(duì)應(yīng)的基礎(chǔ)網(wǎng)絡(luò)。在旋轉(zhuǎn)、平移和尺度失真(RTS)的情況下，ST-CNN根據(jù)用于Tθ的變換類(lèi)別達(dá)到0.5%和0.6%，而使用兩個(gè)maxpooling層來(lái)提供空間不變性的CNN達(dá)到0.8%的誤差。這實(shí)際上與ST-FCN實(shí)現(xiàn)的錯(cuò)誤相同，ST-FCN在其網(wǎng)絡(luò)中沒(méi)有單一的卷積或最大池化層，這表明使用空間轉(zhuǎn)換器是實(shí)現(xiàn)空間不變性的另一種方法。ST-CNN模型的性能始終優(yōu)于ST-FCN模型，因?yàn)镾T-CNN中的max-pooling層提供了更多的空間不變性，卷積層更好地建模局部結(jié)構(gòu)。我們也測(cè)試模型在一個(gè)嘈雜的環(huán)境中,在60×60與翻譯MNIST數(shù)字圖像和背景雜波(見(jiàn)圖1第三行為例):一個(gè)FCN得到13.2%的誤差,CNN獲得3.5%的誤差,而ST-FCN得到2.0%的誤差和ST-CNN得到1.7%的錯(cuò)誤。 
觀察結(jié)果之間的不同類(lèi)型的轉(zhuǎn)換,薄板樣條轉(zhuǎn)換(TPS)是最強(qiáng)大的,能夠減少錯(cuò)誤彈性變形數(shù)字通過(guò)重塑輸入數(shù)字的一個(gè)原型實(shí)例,減少任務(wù)分類(lèi)網(wǎng)絡(luò)的復(fù)雜性,且不適合在簡(jiǎn)單的數(shù)據(jù),比如r .有趣的是,對(duì)所有ST模型的輸入進(jìn)行轉(zhuǎn)換，得到一個(gè)“標(biāo)準(zhǔn)”的直立姿勢(shì)數(shù)字——這是在訓(xùn)練數(shù)據(jù)中發(fā)現(xiàn)的平均姿勢(shì)。在表1(右)中，我們展示了在一些測(cè)試用例中執(zhí)行的轉(zhuǎn)換，其中CNN不能正確地分類(lèi)數(shù)字，但空間轉(zhuǎn)換器網(wǎng)絡(luò)可以。更多的測(cè)試示例可以在一個(gè)動(dòng)畫(huà)中看到https://goo.gl/qdEhUu。

4.2 Street View House Numbers ?

We now test our spatial transformer networks on a challenging real-world dataset, Street View House ?Numbers (SVHN) [25]. This dataset contains around 200k real world images of house numbers, with ?the task to recognise the sequence of numbers in each image. There are between 1 and 5 digits in ?each image, with a large variability in scale and spatial arrangement. ? We follow the experimental setup as in [1, 13], where the data is preprocessed by taking 64 × 64 ?crops around each digit sequence. We also use an additional more loosely 128×128 cropped dataset ?as in [1]. We train a baseline character sequence CNN model with 11 hidden layers leading to five ?independent softmax classifiers, each one predicting the digit at a particular position in the sequence. ?This is the character sequence model used in [19], where each classifier includes a null-character ?output to model variable length sequences. This model matches the results obtained in [13]. ?	我們現(xiàn)在在一個(gè)具有挑戰(zhàn)性的真實(shí)世界數(shù)據(jù)集上測(cè)試我們的空間轉(zhuǎn)換器網(wǎng)絡(luò)，街景房屋號(hào)碼(SVHN)[25]。該數(shù)據(jù)集包含約20萬(wàn)張真實(shí)世界的門(mén)牌號(hào)圖像，任務(wù)是識(shí)別每張圖像中的數(shù)字序列。每張圖像的數(shù)字在1 - 5位之間，在尺度和空間安排上有很大的變異性。我們遵循[1,13]中的實(shí)驗(yàn)設(shè)置，在每個(gè)數(shù)字序列周?chē)?4 × 64個(gè)作物對(duì)數(shù)據(jù)進(jìn)行預(yù)處理。我們還使用另一個(gè)更松散的128×128裁切數(shù)據(jù)集，如[1]中所示。我們訓(xùn)練了一個(gè)基線字符序列CNN模型，該模型有11個(gè)隱藏層，形成5個(gè)獨(dú)立的softmax分類(lèi)器，每個(gè)分類(lèi)器預(yù)測(cè)序列中特定位置的數(shù)字。這是[19]中使用的字符序列模型，其中每個(gè)分類(lèi)器都包含一個(gè)空字符輸出來(lái)為可變長(zhǎng)度序列建模。該模型與[13]得到的結(jié)果相匹配。
We extend this baseline CNN to include a spatial transformer immediately following the input (STCNN ?Single), where the localisation network is a four-layer CNN. We also define another extension ?where before each of the first four convolutional layers of the baseline CNN, we insert a spatial ?transformer (ST-CNN Multi), where the localisation networks are all two layer fully connected networks ?with 32 units per layer. In the ST-CNN Multi model, the spatial transformer before the first ?convolutional layer acts on the input image as with the previous experiments, however the subsequent ?spatial transformers deeper in the network act on the convolutional feature maps, predicting a ?transformation from them and transforming these feature maps (this is visualised in Table 2 (right) ?(a)). This allows deeper spatial transformers to predict a transformation based on richer features ?rather than the raw image. All networks are trained from scratch with SGD and dropout [17], with ?randomly initialised weights, except for the regression layers of spatial transformers which are initialised ?to predict the identity transform. Affine transformations and bilinear sampling kernels are ?used for all spatial transformer networks in these experiments.	我們擴(kuò)展了這個(gè)基線CNN，包括一個(gè)緊跟輸入的空間轉(zhuǎn)換器(STCNN單)，其中定位網(wǎng)絡(luò)是一個(gè)四層的CNN。我們還定義了另一個(gè)擴(kuò)展，在基線CNN的前四個(gè)卷積層之前，我們插入一個(gè)空間轉(zhuǎn)換器(ST-CNN Multi)，其中定位網(wǎng)絡(luò)都是兩層完全連接的網(wǎng)絡(luò)，每層有32個(gè)單元。ST-CNN多模型、空間變壓器之前第一個(gè)卷積層作用于輸入圖像與之前的實(shí)驗(yàn)一樣,然而隨后在回旋的空間變形金剛更深層次的網(wǎng)絡(luò)行為特征圖,預(yù)測(cè)一個(gè)轉(zhuǎn)換和轉(zhuǎn)換這些特征圖(這是呈現(xiàn)在表2(右)(a))。這使得更深層的空間變換器能夠根據(jù)更豐富的特征而不是原始圖像來(lái)預(yù)測(cè)變換。除了空間變換的回歸層被初始化以預(yù)測(cè)身份變換外，所有網(wǎng)絡(luò)都用SGD和dropout[17]從零開(kāi)始訓(xùn)練，并隨機(jī)初始化權(quán)值。在這些實(shí)驗(yàn)中，所有的空間變壓器網(wǎng)絡(luò)都采用了仿射變換和雙線性采樣核。
Table 3: Left: The accuracy on CUB-200-2011 bird classification dataset. Spatial transformer networks with ?two spatial transformers (2×ST-CNN) and four spatial transformers (4×ST-CNN) in parallel achieve higher ?accuracy. 448px resolution images can be used with the ST-CNN without an increase in computational cost ?due to downsampling to 224px after the transformers. Right: The transformation predicted by the spatial ?transformers of 2×ST-CNN (top row) and 4×ST-CNN (bottom row) on the input image. Notably for the ?2×ST-CNN, one of the transformers (shown in red) learns to detect heads, while the other (shown in green) ?detects the body, and similarly for the 4×ST-CNN.	表3:左:CUB-200-2011鳥(niǎo)類(lèi)分類(lèi)數(shù)據(jù)集的精度。兩個(gè)空間變壓器(2×ST-CNN)和四個(gè)空間變壓器(4×ST-CNN)并行的空間變壓器網(wǎng)絡(luò)可以實(shí)現(xiàn)更高的精度。448px分辨率的圖像可以與ST-CNN一起使用，而無(wú)需增加計(jì)算成本，因?yàn)榻?jīng)過(guò)變壓器后降采樣到224px。右:空間變換2×ST-CNN(上一行)和4×ST-CNN(下一行)在輸入圖像上預(yù)測(cè)的變換。值得注意的是2×ST-CNN，其中一個(gè)變形金剛(紅色顯示)學(xué)習(xí)檢測(cè)頭部，而另一個(gè)(綠色顯示)檢測(cè)身體，4×ST-CNN也是如此。
The results of this experiment are shown in Table 2 (left) – the spatial transformer models obtain ?state-of-the-art results, reaching 3.6% error on 64×64 images compared to previous state-of-the-art ?of 3.9% error. Interestingly on 128 × 128 images, while other methods degrade in performance, ?an ST-CNN achieves 3.9% error while the previous state of the art at 4.5% error is with a recurrent ?attention model that uses an ensemble of models with Monte Carlo averaging – in contrast the STCNN ?models require only a single forward pass of a single model. This accuracy is achieved due to ?the fact that the spatial transformers crop and rescale the parts of the feature maps that correspond ?to the digit, focussing resolution and network capacity only on these areas (see Table 2 (right) (b) ?for some examples). In terms of computation speed, the ST-CNN Multi model is only 6% slower ?(forward and backward pass) than the CNN.	本實(shí)驗(yàn)的結(jié)果如表2(左)所示，空間轉(zhuǎn)換器模型獲得了最先進(jìn)的結(jié)果，在64×64圖像上的誤差達(dá)到3.6%，而之前的最先進(jìn)的誤差為3.9%。有趣的是在128×128的圖片,而其他方法降解性能,ST-CNN達(dá)到3.9%錯(cuò)誤在之前的4.5%的誤差是復(fù)發(fā)性注意力模型,它使用一個(gè)模型與蒙特卡羅平均——相比之下STCNN模型只需要一個(gè)傳球前進(jìn)的一個(gè)模型。之所以能達(dá)到這樣的精度，是因?yàn)榭臻g變換器只在這些區(qū)域?qū)εc數(shù)字對(duì)應(yīng)的特征地圖部分進(jìn)行裁剪和縮放，集中分辨率和網(wǎng)絡(luò)容量(一些例子見(jiàn)表2(右)(b))。在計(jì)算速度方面，ST-CNN Multi model僅比CNN慢6%(前向和后向傳遞)。

4.3 Fine-Grained Classification ?

In this section, we use a spatial transformer network with multiple transformers in parallel to perform ?fine-grained bird classification. We evaluate our models on the CUB-200-2011 birds dataset [38], ?containing 6k training images and 5.8k test images, covering 200 species of birds. The birds appear ?at a range of scales and orientations, are not tightly cropped, and require detailed texture and shape ?analysis to distinguish. In our experiments, we only use image class labels for training. ?
We consider a strong baseline CNN model – an Inception architecture with batch normalisation [18] ?pre-trained on ImageNet [26] and fine-tuned on CUB – which by itself achieves the state-of-theart ?accuracy of 82.3% (previous best result is 81.0% [30]). We then train a spatial transformer ?network, ST-CNN, which contains 2 or 4 parallel spatial transformers, parameterised for attention ?and acting on the input image. Discriminative image parts, captured by the transformers, are passed ?to the part description sub-nets (each of which is also initialised by Inception). The resulting part ?representations are concatenated and classified with a single softmax layer. The whole architecture ?is trained on image class labels end-to-end with backpropagation (full details in Appendix A). ?

在本節(jié)中，我們使用一個(gè)并行地包含多個(gè)變壓器的空間變壓器網(wǎng)絡(luò)來(lái)執(zhí)行細(xì)粒度的鳥(niǎo)分類(lèi)。我們?cè)贑UB-200-2011鳥(niǎo)類(lèi)數(shù)據(jù)集[38]上評(píng)估我們的模型，該數(shù)據(jù)集包含6k的訓(xùn)練圖像和5.8k的測(cè)試圖像，涵蓋200種鳥(niǎo)類(lèi)。這些鳥(niǎo)出現(xiàn)在不同的尺度和方向上，沒(méi)有被緊密裁剪，需要詳細(xì)的紋理和形狀分析來(lái)區(qū)分。在我們的實(shí)驗(yàn)中，我們只使用圖像類(lèi)標(biāo)簽進(jìn)行訓(xùn)練。我們認(rèn)為一個(gè)強(qiáng)大的基線CNN模型——一個(gè)在ImageNet[26]上預(yù)先訓(xùn)練并在CUB上進(jìn)行調(diào)整的帶有批處理標(biāo)準(zhǔn)化[18]的初始架構(gòu)——它本身就達(dá)到了最先進(jìn)的82.3%的精度(之前最好的結(jié)果是81.0%[30])。然后我們訓(xùn)練一個(gè)空間變壓器網(wǎng)絡(luò)ST-CNN，它包含2或4個(gè)平行的空間變壓器，參數(shù)化的注意力和作用于輸入圖像。由變形金剛捕獲的鑒別圖像部件被傳遞到部件描述子網(wǎng)(每個(gè)子網(wǎng)也在Inception時(shí)初始化)。產(chǎn)生的部件表示用一個(gè)單一的softmax層連接和分類(lèi)。整個(gè)體系結(jié)構(gòu)是用圖像類(lèi)標(biāo)簽端到端的反向傳播進(jìn)行訓(xùn)練的(詳細(xì)信息見(jiàn)附錄A)。

The results are shown in Table 3 (left). The ST-CNN achieves an accuracy of 84.1%, outperforming ?the baseline by 1.8%. It should be noted that there is a small (22/5794) overlap between the ImageNet ?training set and CUB-200-2011 test set1 – removing these images from the test set results in ?84.0% accuracy with the same ST-CNN. In the visualisations of the transforms predicted by 2×STCNN ?(Table 3 (right)) one can see interesting behaviour has been learnt: one spatial transformer ?(red) has learnt to become a head detector, while the other (green) fixates on the central part of the ?body of a bird. The resulting output from the spatial transformers for the classification network is ?a somewhat pose-normalised representation of a bird. While previous work such as [3] explicitly ?define parts of the bird, training separate detectors for these parts with supplied keypoint training ?data, the ST-CNN is able to discover and learn part detectors in a data-driven manner without any ?additional supervision. In addition, the use of spatial transformers allows us to use 448px resolution ?input images without any impact in performance, as the output of the transformed 448px images are ?downsampled to 224px before being processed.

結(jié)果見(jiàn)表3(左)。ST-CNN的準(zhǔn)確率達(dá)到了84.1%，比基準(zhǔn)高出1.8%。需要注意的是，ImageNet訓(xùn)練集和CUB-200-2011測(cè)試集1之間有一個(gè)小的(22/5794)重疊——在ST-CNN相同的情況下，從測(cè)試集中去除這些圖像的準(zhǔn)確率為84.0%。從2×STCNN(表3(右))預(yù)測(cè)的變換的可視化圖中，我們可以看到人們學(xué)會(huì)了一些有趣的行為:一個(gè)空間變形器(紅色)學(xué)會(huì)了成為頭部探測(cè)器，而另一個(gè)(綠色)則專注于鳥(niǎo)的身體中央部分。從空間變壓器的結(jié)果輸出的分類(lèi)網(wǎng)絡(luò)是一個(gè)姿態(tài)歸一化的鳥(niǎo)類(lèi)表示。雖然之前的工作，如[3]明確地定義了鳥(niǎo)的部分，訓(xùn)練這些部分的單獨(dú)的檢測(cè)器與提供的關(guān)鍵訓(xùn)練數(shù)據(jù)，ST-CNN能夠以數(shù)據(jù)驅(qū)動(dòng)的方式發(fā)現(xiàn)和學(xué)習(xí)部分檢測(cè)器，而不需要任何額外的監(jiān)督。此外，空間轉(zhuǎn)換器的使用允許我們?cè)诓挥绊懶阅艿那闆r下使用448px分辨率的輸入圖像，因?yàn)檗D(zhuǎn)換后的448px圖像的輸出在處理之前會(huì)被向下采樣到224px。

5 Conclusion

In this paper we introduced a new self-contained module for neural networks – the spatial transformer. ?This module can be dropped into a network and perform explicit spatial transformations ?of features, opening up new ways for neural networks to model data, and is learnt in an end-toend ?fashion, without making any changes to the loss function. While CNNs provide an incredibly ?strong baseline, we see gains in accuracy using spatial transformers across multiple tasks, resulting ?in state-of-the-art performance. Furthermore, the regressed transformation parameters from the ?spatial transformer are available as an output and could be used for subsequent tasks. While we ?only explore feed-forward networks in this work, early experiments show spatial transformers to be ?powerful in recurrent models, and useful for tasks requiring the disentangling of object reference ?frames, as well as easily extendable to 3D transformations (see Appendix A.3).

本文介紹了一種新的神經(jīng)網(wǎng)絡(luò)自包含模塊——空間變壓器。該模塊可以放入網(wǎng)絡(luò)中，對(duì)特征進(jìn)行顯式的空間轉(zhuǎn)換，為神經(jīng)網(wǎng)絡(luò)建模數(shù)據(jù)開(kāi)辟了新途徑，并且可以在不改變損失函數(shù)的情況下以端到端方式學(xué)習(xí)。雖然cnn提供了一個(gè)令人難以置信的強(qiáng)大基線，但我們看到在多個(gè)任務(wù)中使用空間轉(zhuǎn)換器的準(zhǔn)確性有所提高，從而產(chǎn)生了最先進(jìn)的性能。此外，從空間轉(zhuǎn)換器返回的轉(zhuǎn)換參數(shù)可作為輸出，并可用于后續(xù)任務(wù)。雖然我們?cè)谶@項(xiàng)工作中只探索了前饋網(wǎng)絡(luò)，但早期的實(shí)驗(yàn)表明，空間轉(zhuǎn)換器在循環(huán)模型中非常強(qiáng)大，對(duì)于需要解離對(duì)象參考框架的任務(wù)非常有用，而且很容易擴(kuò)展到3D轉(zhuǎn)換(見(jiàn)附錄A.3)。

總結(jié)

以上是生活随笔為你收集整理的Paper：《Spatial Transformer Networks》的翻译与解读的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Dataset之babyboom.dat
下一篇： Py之pandas：利用where、re