CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻譯并解讀2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
導(dǎo)讀:人工智能領(lǐng)域,最新計(jì)算機(jī)視覺(jué)文章歷史綜述以及觀察,深度卷積神經(jīng)網(wǎng)絡(luò)的最新架構(gòu)綜述。
?
原作者
Asifullah Khan1, 2*, Anabia Sohail1, 2, Umme Zahoora1, and Aqsa Saeed Qureshi1
1 Pattern Recognition Lab, DCIS, PIEAS, Nilore, Islamabad 45650, Pakistan
2 Deep Learning Lab, Center for Mathematical Sciences, PIEAS, Nilore, Islamabad 45650, Pakistan
asif@pieas.edu.pk
更新中……
相關(guān)文章
CV:翻譯并解讀2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻譯并解讀2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章
CV:翻譯并解讀2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第五章~第八章
?
?
目錄
Abstract
1、Introduction
2 Basic CNN Components
2.1 Convolutional Layer
2.2 Pooling Layer
2.3 Activation Function
2.4 Batch Normalization
2.5 Dropout
2.6 Fully Connected Layer
3 Architectural Evolution of Deep CNN
3.1 Late 1980s-1999: Origin of CNN
3.2 Early 2000: Stagnation of CNN
3.3 2006-2011: Revival of CNN
3.4 2012-2014: Rise of CNN
3.5 2015-Present: Rapid increase in Architectural Innovations and Applications of CNN
?
原文下載:https://download.csdn.net/download/qq_41185868/15548439
Abstract
| ? ? ? ? Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art performance on various competitive benchmarks. The powerful learning ability of deep CNN is largely due to the use of multiple feature extraction stages (hidden layers) that can automatically learn representations from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs, and recently very interesting deep CNN architectures are reported. The recent race in developing deep CNNs shows that the innovative architectural ideas, as well as parameter optimization, can improve CNN performance. In this regard, different ideas in the CNN design have been explored such as the use of different activation and loss functions, parameter optimization, regularization, and restructuring of the processing units. However, the major improvement in representational capacity of the deep CNN is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is receiving substantial attention. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention. Additionally, this survey also covers the elementary understanding of CNN components and sheds light on its current challenges and applications. | ? ? ? ?深度卷積神經(jīng)網(wǎng)絡(luò)(CNNs)是一種特殊類(lèi)型的神經(jīng)網(wǎng)絡(luò),在各種競(jìng)爭(zhēng)性基準(zhǔn)測(cè)試中表現(xiàn)出了最先進(jìn)的性能。深度CNN強(qiáng)大的學(xué)習(xí)能力很大程度上是由于它使用了多個(gè)特征提取階段(隱含層),可以從數(shù)據(jù)中自動(dòng)學(xué)習(xí)表示。大量數(shù)據(jù)的可用性和硬件處理單元的改進(jìn)加速了CNNs的研究,并且,最近報(bào)道了非常有意思的深度CNN架構(gòu)。最近開(kāi)發(fā)深度CNNs的競(jìng)賽表明,創(chuàng)新的架構(gòu)思想和參數(shù)優(yōu)化可以提高CNN的性能。為此,在CNN的設(shè)計(jì)中探索了不同的思路,如使用不同的激活和丟失函數(shù)、參數(shù)優(yōu)化、正則化以及處理單元的重組。然而,深度CNN的代表性能力的主要提高是通過(guò)處理單元的重組實(shí)現(xiàn)的。特別是,使用一個(gè)塊作為一個(gè)結(jié)構(gòu)單元而不是一層的想法正在得到大量的關(guān)注。因此,本次調(diào)查的重點(diǎn)是最近報(bào)道的深度CNN架構(gòu)的內(nèi)在分類(lèi),因此,將CNN架構(gòu)的最新創(chuàng)新分為七個(gè)不同的類(lèi)別。這七個(gè)類(lèi)別分別基于空間開(kāi)發(fā)、深度、多路徑、寬度、特征地圖開(kāi)發(fā)、通道提升和注意力機(jī)制。此外,本調(diào)查還涵蓋了對(duì)CNN組件的基本理解,并闡明了其當(dāng)前的挑戰(zhàn)和應(yīng)用。 |
| Keywords: Deep Learning, Convolutional Neural Networks, Architecture, Representational Capacity, Residual Learning, and Channel Boosted CNN. | 關(guān)鍵詞:深度學(xué)習(xí),卷積神經(jīng)網(wǎng)絡(luò),架構(gòu),表征能力,殘差學(xué)習(xí),通道提升的CNN |
?
1、Introduction
| ? ? ? ??Machine Learning (ML) algorithms belong to a specialized area in Artificial Intelligence (AI), which endows intelligence to computers by learning the underlying relationships among the data and making decisions without being explicitly programmed. Different ML algorithms have been developed since the late 1990s, for the emulation of human sensory responses such as speech and vision, but they have generally failed to achieve human-level satisfaction [1]–[6]. The challenging nature of Machine Vision (MV) tasks gives rise to a specialized class of Neural Networks (NN), known as Convolutional Neural Network (CNN) [7]. | ? ?機(jī)器學(xué)習(xí)(ML)算法屬于人工智能(AI)的一個(gè)專(zhuān)門(mén)領(lǐng)域,它通過(guò)學(xué)習(xí)數(shù)據(jù)之間的基本關(guān)系并在沒(méi)有顯示編程的情況下做出決策,從而賦予計(jì)算機(jī)智能。自20世紀(jì)90年代末以來(lái),針對(duì)語(yǔ)音、視覺(jué)等人類(lèi)感官反應(yīng)的仿真,人們開(kāi)發(fā)了各種各樣的ML算法,但普遍未能達(dá)到人的滿意程度[1]-[6]。由于機(jī)器視覺(jué)(MV)任務(wù)的挑戰(zhàn)性,產(chǎn)生了一類(lèi)專(zhuān)門(mén)的神經(jīng)網(wǎng)絡(luò)(NN),稱(chēng)為卷積神經(jīng)網(wǎng)絡(luò)(CNN)[7]。 |
| ? ? ?CNNs are considered as one of the best techniques for learning image content and have shown state-of-the-art results on image recognition, segmentation, detection, and retrieval related tasks [8], [9]. The success of CNN has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have developed active research groups for exploring new architectures of CNN [10]. At present, most of the frontrunners of image processing competitions are employing deep CNN based models. | CNNs被認(rèn)為是學(xué)習(xí)圖像內(nèi)容的最佳技術(shù)之一,在圖像識(shí)別、分割、檢測(cè)和檢索相關(guān)任務(wù)[8]、[9]方面已經(jīng)取得了最新的成果。CNN的成功吸引了學(xué)術(shù)界以外的關(guān)注。在業(yè)界,谷歌、微軟、AT&T、NEC、Facebook等公司都建立了活躍的研究小組,探索CNN[10]的新架構(gòu)。目前,大多數(shù)圖像處理競(jìng)賽的領(lǐng)跑者,都在使用基于深度CNN的模型。 |
| The topology of CNN is divided into multiple learning stages composed of a combination of the convolutional layer, non-linear processing units, and subsampling layers [11]. Each layer performs multiple transformations using a bank of convolutional kernels (filters) [12]. Convolution operation extracts locally correlated features by dividing the image into small slices (similar to the retina of the human eye), making it capable of learning suitable features. Output of the convolutional kernels is assigned to non-linear processing units, which not only helps in learning abstraction but also embeds non-linearity in the feature space. This non-linearity generates different patterns of activations for different responses and thus facilitates in learning of semantic differences in images. Output of the non-linear function is usually followed by subsampling, which helps in summarizing the results and also makes the input invariant to geometrical distortions [12], [13]. | CNN的拓?fù)浣Y(jié)構(gòu)分為多個(gè)學(xué)習(xí)階段,包括卷積層、非線性處理單元和子采樣層的組合[11]。每一層使用一組卷積核(濾波器)執(zhí)行多重變換[12]。卷積操作通過(guò)將圖像分割成小塊(類(lèi)似于人眼視網(wǎng)膜)來(lái)提取局部相關(guān)特征,使其能夠學(xué)習(xí)合適的特征。卷積核的輸出被分配給非線性處理單元,這不僅有助于學(xué)習(xí)抽象,而且在特征空間中嵌入非線性。這種非線性會(huì)為不同的反應(yīng)產(chǎn)生不同的激活模式,從而有助于學(xué)習(xí)圖像中的語(yǔ)義差異。非線性函數(shù)的輸出通常隨后是子采樣,這有助于總結(jié)結(jié)果,并使輸入對(duì)幾何畸變保持不變[12],[13]。 |
| The architectural design of CNN was inspired by Hubel and Wiesel’s work and thus largely follows the basic structure of primate’s visual cortex [14], [15]. CNN first came to limelight through the work of LeCuN in 1989 for the processing of grid-like topological data (images and?time series data) [7], [16]. The popularity of CNN is largely due to its hierarchical feature extraction ability. Hierarchical organization of CNN emulates the deep and layered learning process of the Neocortex in the human brain, which automatically extract features from the underlying data [17]. The staging of learning process in CNN shows quite resemblance with primate’s ventral pathway of visual cortex (V1-V2-V4-IT/VTC) [18]. The visual cortex of primates first receives input from the retinotopic area, where multi-scale highpass filtering and contrast normalization is performed by the lateral geniculate nucleus. After this, detection is performed by different regions of the visual cortex categorized as V1, V2, V3, and V4. In fact, V1 and V2 portion of visual cortex are similar to convolutional, and subsampling layers, whereas inferior temporal region resembles the higher layers of CNN, which makes inference about the image [19]. During training, CNN learns through backpropagation algorithm, by regulating the change in weights with respect to the input. Minimization of a cost function by CNN using backpropagation algorithm is similar to the response based learning of human brain. CNN has the ability to extract low, mid, and high-level features. High level features (more abstract features) are a combination of lower and mid-level features. With the automatic feature extraction ability, CNN reduces the need for synthesizing a separate feature extractor [20]. Thus, CNN can learn good internal representation from raw pixels with diminutive processing. | CNN的架構(gòu)設(shè)計(jì)靈感來(lái)自于Hubel和Wiesel的工作,因此很大程度上遵循了靈長(zhǎng)類(lèi)動(dòng)物視覺(jué)皮層的基本結(jié)構(gòu)[14],[15]。CNN最早是在1989年通過(guò)LeCuN的工作引起了人們的注意,它處理了網(wǎng)格狀的拓?fù)鋽?shù)據(jù)(圖像和時(shí)間序列數(shù)據(jù))[7],[16]。CNN的流行很大程度上是由于它的層次特征提取能力。CNN的分層組織模擬人腦皮層的深層和分層學(xué)習(xí)過(guò)程,它自動(dòng)從底層數(shù)據(jù)中提取特征[17]。CNN中學(xué)習(xí)過(guò)程的分期與靈長(zhǎng)類(lèi)視覺(jué)皮層腹側(cè)通路(V1-V2-V4-IT/VTC)非常相似[18]。靈長(zhǎng)類(lèi)動(dòng)物的視覺(jué)皮層首先接收來(lái)自視黃醇區(qū)的輸入,在視黃醇區(qū),外側(cè)膝狀體核進(jìn)行多尺度高通濾波和對(duì)比度歸一化。之后,由視覺(jué)皮層的不同區(qū)域進(jìn)行檢測(cè),這些區(qū)域分為V1、V2、V3和V4。事實(shí)上,視覺(jué)皮層的V1和V2部分與卷積層和亞采樣層相似,而顳下區(qū)與CNN的高層相似,后者對(duì)圖像進(jìn)行推斷[19]。在訓(xùn)練過(guò)程中,CNN通過(guò)反向傳播算法學(xué)習(xí),通過(guò)調(diào)節(jié)輸入權(quán)重的變化。使用反向傳播算法的CNN最小化代價(jià)函數(shù)類(lèi)似于基于響應(yīng)的人腦學(xué)習(xí)。CNN能夠提取低、中、高級(jí)特征。高級(jí)特征(更抽象的特征)是低級(jí)和中級(jí)特征的組合。具有自動(dòng)特征提取功能,CNN減少了合成單獨(dú)特征提取器的需要[20]。因此,CNN可以通過(guò)較小的處理從原始像素中學(xué)習(xí)良好的內(nèi)部表示。 |
| The main boom in the use of CNN for image classification and segmentation occurred after it was observed that the representational capacity of a CNN can be enhanced by increasing its depth [21]. Deep architectures have an advantage over shallow architectures, when dealing with complex learning problems. Stacking of multiple linear and non-linear processing units in a layer wise fashion provides deep networks the ability to learn complex representations at different levels of abstraction. In addition, advancements in hardware and thus the availability of high computing resources is also one of the main reasons of the recent success of deep CNNs. Deep CNN architectures have shown significant performance of improvements over shallow and conventional vision based models. Apart from its use in supervised learning, deep CNNs have potential to learn useful representation from large scale of unlabeled data. Use of the multiple mapping functions by CNN enables it to improve the extraction of invariant representations and consequently, makes it capable to handle recognition tasks of hundreds of categories. Recently, it is shown that different level of features including both low and high-level can be transferred to a?generic recognition task by exploiting the concept of Transfer Learning (TL) [22]–[24]. Important attributes of CNN are hierarchical learning, automatic feature extraction, multi-tasking, and weight sharing [25]–[27]. | CNN用于圖像分類(lèi)和分割的主要興起發(fā)生在觀察到CNN的表示能力可以通過(guò)增加其深度來(lái)增強(qiáng)之后[21]。在處理復(fù)雜的學(xué)習(xí)問(wèn)題時(shí),深度架構(gòu)比淺層架構(gòu)具有優(yōu)勢(shì)。以分層方式堆疊多個(gè)線性和非線性處理單元,使深層網(wǎng)絡(luò)能夠在不同抽象級(jí)別學(xué)習(xí)復(fù)雜表示。此外,硬件的進(jìn)步以及高計(jì)算資源的可用性也是deep CNNs最近成功的主要原因之一。深度CNN架構(gòu)已經(jīng)顯示出比淺層和傳統(tǒng)的基于視覺(jué)的模型有顯著改進(jìn)的性能。除了在監(jiān)督學(xué)習(xí)中的應(yīng)用外,深度CNN還具有從大規(guī)模未標(biāo)記數(shù)據(jù)中學(xué)習(xí)有用表示的潛力。利用CNN的多重映射函數(shù),提高了不變量表示的提取效率,使其能夠處理數(shù)百個(gè)類(lèi)別的識(shí)別任務(wù)。近年來(lái),研究表明,利用遷移學(xué)習(xí)(TL)[22]-[24]的概念,可以將包括低層和高層特征在內(nèi)的不同層次的特征,轉(zhuǎn)化為一般的識(shí)別任務(wù)。CNN的重要特性是分層學(xué)習(xí)、自動(dòng)特征提取、多任務(wù)處理和權(quán)重共享[25]-[27]。 |
| ? ? ? ??Various improvements in CNN learning strategy and architecture were performed to make CNN scalable to large and complex problems. These innovations can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that CNN based applications became prevalent after the exemplary performance of AlexNet on ImageNet dataset [21]. Thus major innovations in CNN have been proposed since 2012 and were mainly due to restructuring of processing units and designing of new blocks. Similarly, Zeiler and Fergus [28] introduced the concept of layer-wise visualization of features, which shifted the trend towards extraction of features at low spatial resolution in deep architecture such as VGG [29]. Nowadays, most of the new architectures are built upon the principle of simple and homogenous topology introduced by VGG. On the other hand, Google group introduced an interesting idea of split, transform, and merge, and the corresponding block is known as inception block. The inception block for the very first time gave the concept of branching within a layer, which allows abstraction of features at different spatial scales [30]. In 2015, the concept of skip connections introduced by ResNet [31] for the training of deep CNNs got famous, and afterwards, this concept was used by most of the succeeding Nets, such as Inception-ResNet, WideResNet, ResNext, etc [32]–[34]. | 在CNN學(xué)習(xí)策略和體系結(jié)構(gòu)方面進(jìn)行了各種改進(jìn),使CNN能夠擴(kuò)展到大型復(fù)雜問(wèn)題。這些創(chuàng)新可分為參數(shù)優(yōu)化、正則化、結(jié)構(gòu)重構(gòu)等。然而,據(jù)觀察,在AlexNet在ImageNet數(shù)據(jù)集上的示范性能之后,基于CNN的應(yīng)用變得普遍[21]。因此,自2012年以來(lái),CNN提出了重大創(chuàng)新,主要?dú)w功于處理單元的重組和新區(qū)塊的設(shè)計(jì)。類(lèi)似地,Zeiler和Fergus[28]引入了特征分層可視化的概念,這改變了深度架構(gòu)(如VGG[29])中以低空間分辨率提取特征的趨勢(shì)。目前,大多數(shù)新的體系結(jié)構(gòu)都是基于VGG提出的簡(jiǎn)單、同質(zhì)的拓?fù)浣Y(jié)構(gòu)原理。另一方面,Google group引入了一個(gè)有趣的拆分、轉(zhuǎn)換和合并的概念,相應(yīng)的塊稱(chēng)為inception塊。inception塊第一次給出了層內(nèi)分支的概念,允許在不同的空間尺度上抽象特征[30]。2015年,ResNet[31]提出的用于訓(xùn)練深層CNNs的skip連接的概念很出名,之后,這個(gè)概念被大多數(shù)后續(xù)網(wǎng)絡(luò)使用,如Inception ResNet、WideResNet、ResNext等[32]-[34]。 |
| ? ? ? ? In order to improve the learning capacity of a CNN, different architectural designs such as WideResNet, Pyramidal Net, Xception etc. explored the effect of multilevel transformations in terms of an additional cardinality and increase in width [32], [34], [35]. Therefore, the focus of research shifted from parameter optimization and connections readjustment towards improved architectural design (layer structure) of the network. This shift resulted in many new architectural ideas such as channel boosting, spatial and channel wise exploitation and attention based information processing etc. [36]–[38]. | 為了提高CNN的學(xué)習(xí)能力,不同的結(jié)構(gòu)設(shè)計(jì),如WideResNet、金字塔網(wǎng)、exception等,從增加基數(shù)和增加寬度的角度探討了多級(jí)轉(zhuǎn)換的效果[32]、[34]、[35]。因此,研究的重點(diǎn)從網(wǎng)絡(luò)的參數(shù)優(yōu)化和連接調(diào)整轉(zhuǎn)向網(wǎng)絡(luò)的改進(jìn)結(jié)構(gòu)設(shè)計(jì)(層結(jié)構(gòu))。這種轉(zhuǎn)變產(chǎn)生了許多新的架構(gòu)思想,如信道增強(qiáng)、空間和信道利用以及基于注意力的信息處理等[36]-[38]。 |
| In the past few years, different interesting surveys are conducted on deep CNNs that elaborate the basic components of CNN and their alternatives. The survey reported by [39] has reviewed the famous architectures from 2012-2015 along with their components. Similarly, in the?literature, there are prominent surveys that discuss different algorithms of CNN and focus on applications of CNN [20], [26], [27], [40], [41]. Likewise, the survey presented in [42] discussed taxonomy of CNNs based on acceleration techniques. On the other hand, in this survey, we discuss the intrinsic taxonomy present in the recent and prominent CNN architectures. The various CNN architectures discussed in this survey can be broadly classified into seven main categories namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The rest of the paper is organized in the following order (shown in Fig. 1): Section 1 summarizes the underlying basics of CNN, its resemblance with primate’s visual cortex, as well as its contribution in MV. In this regard, Section 2 provides the overview on basic CNN components and Section 3 discusses the architectural evolution of deep CNNs. Whereas, Section 4, discusses the recent innovations in CNN architectures and categorizes CNNs into seven broad classes. Section 5 and 6 shed light on applications of CNNs and current challenges, whereas section 7 discusses future work and last section draws conclusion. | 在過(guò)去的幾年里,對(duì)深度CNN進(jìn)行了不同有趣的調(diào)查,闡述了CNN的基本組成部分及其替代方案。[39]報(bào)告的調(diào)查回顧了2012-2015年著名架構(gòu)及其組成部分。類(lèi)似地,在文獻(xiàn)中,有一些著名的調(diào)查討論了CNN的不同算法,并著重于CNN的應(yīng)用[20]、[26]、[27]、[40]、[41]。同樣,在[42]中提出的調(diào)查討論了基于加速技術(shù)的CNNs分類(lèi)。另一方面,在這項(xiàng)調(diào)查中,我們討論了在最近和著名的CNN架構(gòu)中存在的內(nèi)在分類(lèi)法。本次調(diào)查中討論的各種CNN架構(gòu)大致可分為七大類(lèi),即:空間開(kāi)發(fā)、深度、多徑、寬度、特征地圖開(kāi)發(fā)、信道增強(qiáng)和基于注意力的CNN。論文的其余部分按以下順序組織(如圖1所示):第1節(jié)總結(jié)了CNN的基本原理,它與靈長(zhǎng)類(lèi)視覺(jué)皮層的相似性,以及它在MV中的貢獻(xiàn)。在這方面,第2節(jié)概述了基本CNN組件,第3節(jié)討論了deep CNNs的體系結(jié)構(gòu)演變。第4節(jié)討論了CNN體系結(jié)構(gòu)的最新創(chuàng)新,并將CNN分為七大類(lèi)。第5節(jié)和第6節(jié)闡述了CNNs的應(yīng)用和當(dāng)前面臨的挑戰(zhàn),第7節(jié)討論了未來(lái)的工作,最后一節(jié)得出結(jié)論。 |
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Fig. 1: Organization of the survey paper.
?
2 Basic CNN Components
| ? ? ? ??Nowadays, CNN is considered as the most widely used ML technique, especially in vision related applications. CNNs have recently shown state-of-the-art results in various ML applications. A typical block diagram of an ML system is shown in Fig. 2. Since, CNN possesses both good feature extraction and strong discrimination ability, therefore in a ML system; it is mostly used for feature extraction and classification. | ? ?目前,CNN被認(rèn)為是應(yīng)用最廣泛的ML技術(shù),尤其是在視覺(jué)相關(guān)應(yīng)用中。CNNs最近在各種ML應(yīng)用中顯示了最新的結(jié)果。ML系統(tǒng)的典型框圖如圖2所示。由于CNN具有良好的特征提取和較強(qiáng)的識(shí)別能力,因此在ML系統(tǒng)中,它主要用于特征提取和分類(lèi)。 |
| A typical CNN architecture generally comprises of alternate layers of convolution and pooling followed by one or more fully connected layers at the end. In some cases, fully connected layer is replaced with global average pooling layer. In addition to the various learning stages, different regulatory units such as batch normalization and dropout are also incorporated to optimize CNN performance [43]. The arrangement of CNN components play a fundamental role in designing new architectures and thus achieving enhanced performance. This section briefly discusses the role of these components in CNN architecture. | 典型的CNN體系結(jié)構(gòu),通常包括交替的卷積層和池化,最后是一個(gè)或多個(gè)完全連接的層。在某些情況下,完全連接層被替換為全局平均池層。除了不同的學(xué)習(xí)階段,不同的常規(guī)單位,如 batch normalization和dropout,也被納入優(yōu)化CNN的表現(xiàn)[43]。CNN組件的排列在設(shè)計(jì)新的體系結(jié)構(gòu)和提高性能方面起著基礎(chǔ)性的作用。本節(jié)簡(jiǎn)要討論這些組件在CNN架構(gòu)中的作用。 |
?
2.1 Convolutional Layer
| ? ? ? ??Convolutional layer is composed of a set of convolutional kernels (each neuron act as a kernel). These kernels are associated with a small area of the image known as a receptive field. It works by dividing the image into small blocks (receptive fields) and convolving them with a specific set of weights (multiplying elements of the filter with the corresponding receptive field elements) [43]. Convolution operation can expressed as follows: | ? ? ? ?卷積層由一組卷積核組成(每個(gè)神經(jīng)元充當(dāng)一個(gè)核)。這些核與被稱(chēng)為感受野的圖像的一小部分相關(guān)。它的工作原理是將圖像分割成小的塊(接收?qǐng)?#xff09;,并用一組特定的權(quán)重(將濾波器的元素與相應(yīng)的接收?qǐng)鲈叵喑?/span>)[43]。卷積運(yùn)算可以表示為: |
| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Where, the input image is represented by x, y I , , xy shows spatial locality and k | 其中,輸入圖像由x,y I,x y表示空間局部性,k l k表示第k層的第l卷積核。將圖像分割成小塊有助于提取局部相關(guān)像素值。這種局部聚集的信息也被稱(chēng)為特征模體。在相同的權(quán)值集下,通過(guò)滑動(dòng)卷積核提取圖像中不同的特征集。與全連通網(wǎng)絡(luò)相比,卷積運(yùn)算的這種權(quán)值共享特性使得CNN參數(shù)更有效。卷積操作還可以基于濾波器的類(lèi)型和大小、填充的類(lèi)型和卷積的方向而被分為不同的類(lèi)型[44]。另外,如果核是對(duì)稱(chēng)的,卷積操作就變成相關(guān)性操作[16]。 |
?
2.2 Pooling Layer
| ? ? ? ??Feature motifs, which result as an output of convolution operation can occur at different locations in the image. Once features are extracted, its exact location becomes less important as long as its approximate position relative to others is preserved. Pooling or downsampling like convolution, is an interesting local operation. It sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region [45]. | ? ? ? ?卷積運(yùn)算輸出的特圖案可以出現(xiàn)在圖像的不同位置。一旦特征被提取,其精確位置就變得不那么重要了,只要其相對(duì)于其他位置的近似位置被保留。像卷積一樣的池化或下采樣是一種有趣的本地操作。它總結(jié)了接受野附近的相似信息,并輸出了該局部區(qū)域內(nèi)的主導(dǎo)反應(yīng)[45]。 |
| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Equation (2) shows the pooling operation in which l Z represents the lth output feature map, ,lxyF? shows the lth input feature map, whereas p f (.) defines the type of pooling operation. The use ofpooling operation helps to extract a combination of features, which are invariant to translational shifts and small distortions [13], [46]. Reduction in the size of feature map to invariant feature set not only regulates complexity of the network but also helps in increasing the generalization by reducing overfitting. Different types of pooling formulations such as max, average, L2, overlapping, spatial pyramid pooling, etc. are used in CNN [47]–[49]. | ? ? ? ? 等式(2)表示池操作,其中l(wèi) Z表示lth輸出特征映射,lxyF表示lth輸入特征映射,而p f(.)定義池操作的類(lèi)型。使用pooling操作有助于提取特征的組合,這些特征對(duì)平移位移和小的失真是不變的[13],[46]。將特征映射的大小減少到不變特征集不僅可以調(diào)節(jié)網(wǎng)絡(luò)的復(fù)雜度,而且有助于通過(guò)減少過(guò)擬合來(lái)增加泛化。CNN中使用了不同類(lèi)型的池公式,如max、average、L2、overlapping、空間金字塔池化等[47]–[49]。 |
?
2.3 Activation Function
| ? ? ? ??Activation function serves as a decision function and helps in learning a complex pattern. Selection of an appropriate activation function can accelerate the learning process. Activation function for a convolved feature map is defined in equation (3). | ? ? ? ?激活函數(shù)作為一個(gè)決策函數(shù),有助于學(xué)習(xí)一個(gè)復(fù)雜的模式。選擇合適的激活函數(shù)可以加速學(xué)習(xí)過(guò)程。卷積特征映射的激活函數(shù)在方程(3)中定義。 |
| In above equation, k l F is an output of a convolution operation, which is assigned to activation? function; A f (.) that adds non-linearity and returns a transformed output k? l T for kth layer. In? literature, different activation functions such as sigmoid, tanh, maxout, ReLU, and variants of? ReLU such as leaky ReLU, ELU, and PReLU [39], [48], [50], [51] are used to inculcate nonlinear? combination of features. However, ReLU and its variants are preferred over others? activations as it helps in overcoming the vanishing gradient problem [52], [53]. | ? ? ?在上面的等式中,k l F是卷積運(yùn)算的輸出,該卷積運(yùn)算被分配給激活函數(shù);F(.)添加非線性并返回第k層的轉(zhuǎn)換輸出k l T。在文獻(xiàn)中,不同的激活函數(shù)如sigmoid、tanh、maxout、ReLU和ReLU的變體如leaky ReLU、ELU和PReLU[39]、[48]、[50]、[51]被用來(lái)灌輸特征的非線性組合。然而,ReLU及其變體比其他激活更受歡迎,因?yàn)樗兄诳朔荻葐?wèn)題[52],[53]。 |
| ? ? ? ? ?Fig. 2: Basic layout of a typical ML system. In ML related tasks, initially data is preprocessed and then assigned to a classification system. A typical ML problem follows three steps: stage 1 is related to data gathering and generation, stage 2 performs preprocessing and feature selection, whereas stage 3 is based on model selection, parameter tuning, and analysis. CNN has a good feature extraction and strong discrimination ability, therefore in a ML system; it can be used for feature extraction and classification. | ? ?圖2:典型ML系統(tǒng)的基本布局。在與ML相關(guān)的任務(wù)中,首先對(duì)數(shù)據(jù)進(jìn)行預(yù)處理,然后將其分配給分類(lèi)系統(tǒng)。一個(gè)典型的ML問(wèn)題有三個(gè)步驟:階段1與數(shù)據(jù)收集和生成相關(guān),階段2執(zhí)行預(yù)處理和特征選擇,而階段3基于模型選擇、參數(shù)調(diào)整和分析。CNN具有很好的特征提取能力和較強(qiáng)的識(shí)別能力,因此在ML系統(tǒng)中可以用于特征提取和分類(lèi)。 |
?
?
2.4 Batch Normalization
注:根據(jù)博主的經(jīng)驗(yàn),此處常為考點(diǎn)!
| ? ? ? ??Batch normalization is used to address the issues related to internal covariance shift within feature maps. The internal covariance shift is a change in the distribution of hidden units’ values, which slow down the convergence (by forcing learning rate to small value) and requires careful initialization of parameters. Batch normalization for a transformed feature map k lT is shown in equation (4). | ? ? ? ?批處理規(guī)范化用于解決與特征映射內(nèi)部協(xié)方差偏移相關(guān)的問(wèn)題。內(nèi)協(xié)方差偏移是隱藏單元值分布的一種變化,它會(huì)減慢收斂速度(通過(guò)強(qiáng)制學(xué)習(xí)速率為小值),并且需要謹(jǐn)慎的初始化參數(shù)。轉(zhuǎn)換后的特征映射k lT的批處理規(guī)范化如等式(4)所示。 |
| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? In equation (4), k l N represents normalized feature map, kl F is the input feature map, B? and 2 B ?? depict mean and variance of a feature map for a mini batch respectively. Batch normalization? unifies the distribution of feature map values by bringing them to zero mean and unit variance [54]. Furthermore, it smoothens the flow of gradient and acts as a regulating factor, which thus helps in improving generalization of the network. | ? ? ?在式(4)中,k l N表示歸一化特征映射,kl F是輸入特征映射,Bμ和2 B?分別表示小批量特征映射的均值和方差。批量規(guī)范化通過(guò)使特征映射值的平均值和單位方差為零來(lái)統(tǒng)一分布[54]。此外,它平滑了梯度的流動(dòng),起到了調(diào)節(jié)因子的作用,從而有助于提高網(wǎng)絡(luò)的泛化能力。 |
?
2.5 Dropout
| ? ? ? ??Dropout introduces regularization within the network, which ultimately improves generalization by randomly skipping some units or connections with a certain probability. In NNs, multiple connections that learn a non-linear relation are sometimes co-adapted, which causes overfitting [55]. This random dropping of some connections or units produces several thinned network architectures, and finally one representative network is selected with small weights. This selected architecture is then considered as an approximation of all of the proposed networks [56]. | ? ? ? ?Dropout在網(wǎng)絡(luò)中引入正則化,通過(guò)隨機(jī)跳過(guò)某些具有一定概率的單元或連接,最終提高泛化能力。在NNs中,學(xué)習(xí)非線性關(guān)系的多個(gè)連接有時(shí)是協(xié)同適應(yīng)的,這會(huì)導(dǎo)致過(guò)度擬合[55]。一些連接或單元的隨機(jī)丟棄產(chǎn)生了幾種細(xì)化的網(wǎng)絡(luò)結(jié)構(gòu),最后選擇了一種具有代表性的網(wǎng)絡(luò)結(jié)構(gòu)。然后將所選擇的體系結(jié)構(gòu)看作是所提出的所有網(wǎng)絡(luò)的近似〔56〕。 |
?
2.6 Fully Connected Layer
| ? ? ? ?Fully connected layer is mostly used at the end of the network for classification purpose. Unlike pooling and convolution, it is a global operation. It takes input from the previous layer and globally analyses output of all the preceding layers [57]. This makes a non-linear combination of selected features, which are used for the classification of data. [58]. | ? ? ? ?全連接層主要用于網(wǎng)絡(luò)末端的分類(lèi)。與池化和卷積不同,它是一個(gè)全局操作。它接受前一層的輸入,并全局分析所有前一層的輸出[57]。這使得用于數(shù)據(jù)分類(lèi)的選定特征的非線性組合。[58]。 |
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Fig. 3: Evolutionary history of deep CNNs
?
3 Architectural Evolution of Deep CNN
| ? ? ? ?Nowadays, CNNs are considered as the most widely used algorithms among biologically inspired AI techniques. CNN history begins from the neurobiological experiments conducted by Hubel and Wiesel (1959, 1962) [14], [59]. Their work provided a platform for many cognitive models, almost all of which were latterly replaced by CNN. Over the decades, different efforts have been carried out to improve the performance of CNNs. This history is pictorially represented in Fig. 3. These improvements can be categorized into five different eras and are discussed below. | ? ? ? ?目前,CNNs被認(rèn)為是生物人工智能技術(shù)中應(yīng)用最廣泛的算法。CNN的歷史始于Hubel和Wiesel(19591962)[14],[59]進(jìn)行的神經(jīng)生物學(xué)實(shí)驗(yàn)。他們的工作為許多認(rèn)知模型提供了一個(gè)平臺(tái),幾乎所有的認(rèn)知模型都被CNN所取代。幾十年來(lái),人們一直在努力提高CNNs的性能。這段歷史在圖3中用圖形表示這些改進(jìn)可以分為五個(gè)不同的時(shí)代,并在下面討論。 |
?
3.1 Late 1980s-1999: Origin of CNN
| ? ? ? ?CNNs have been applied to visual tasks since the late 1980s. In 1989, LeCuN et al. proposed the first multilayered CNN named as ConvNet, whose origin rooted in Fukushima’s Neocognitron [60], [61]. LeCuN proposed supervised training of ConvNet, using Backpropagation algorithm [7], [62] in comparison to the unsupervised reinforcement learning scheme used by its predecessor Neocognitron. LeCuN’s work thus made a foundation for the modern 2D CNNs. Supervised training in CNN provides the automatic feature learning ability from raw input, rather than designing of handcrafted features, used by traditional ML methods. This ConvNet showed successful results for handwritten digit and zip code recognition related problems [63]. In 1998, ConvNet was improved by LeCuN and used for classifying characters in a document recognition application [64]. This modified architecture was named as LeNet-5, which was an improvement over the initial CNN as it can extract feature representation in a hierarchical way from raw pixels [65]. Reliance of LeNet-5 on fewer parameters along with consideration of spatial topology of images enabled CNN to recognize rotational variants of the image [65]. Due to the good performance of CNN in optical character recognition, its commercial use in ATM and Banks started in 1993 and 1996, respectively. Though, many successful milestones were achieved by LeNet-5, yet the main concern associated with it was that its discrimination power was not scaled to classification tasks other than hand recognition. | ? ? ? ?自20世紀(jì)80年代末以來(lái),CNNs已經(jīng)被應(yīng)用于視覺(jué)任務(wù)中。提出了第一個(gè)叫做ConvNet的多層CNN,其起源于Fukushima’s 的Neocognitron[60],[61]。LeCuN提出了ConvNet的有監(jiān)督訓(xùn)練,使用了Backpropagation算法[7],[62],與其前身Neocognitron使用的無(wú)監(jiān)督強(qiáng)化學(xué)習(xí)方案相比。他的作品為現(xiàn)代2D CNN奠定了基礎(chǔ)。CNN中的監(jiān)督訓(xùn)練提供了從原始輸入中自動(dòng)學(xué)習(xí)特征的能力,而不是傳統(tǒng)ML方法所使用的手工特征的設(shè)計(jì)。這個(gè)ConvNet顯示了手寫(xiě)數(shù)字和郵政編碼識(shí)別相關(guān)問(wèn)題的成功結(jié)果[63]。1998年,LeCuN改進(jìn)了ConvNet,并將其用于文檔識(shí)別應(yīng)用程序中的字符分類(lèi)[64]。這種改進(jìn)的結(jié)構(gòu)被命名為L(zhǎng)eNet-5,這是對(duì)初始CNN的改進(jìn),因?yàn)樗梢詮脑枷袼刂幸苑謱拥姆绞教崛√卣鞅硎綶65]。LeNet-5對(duì)較少參數(shù)的依賴(lài)以及對(duì)圖像空間拓?fù)涞目紤]使得CNN能夠識(shí)別圖像的旋轉(zhuǎn)變體[65]。由于CNN在光學(xué)字符識(shí)別方面的良好性能,其在ATM和銀行的商業(yè)應(yīng)用分別始于1993年和1996年。盡管LeNet-5取得了許多成功的里程碑,但與之相關(guān)的主要問(wèn)題是它的辨別能力并沒(méi)有擴(kuò)展到除手識(shí)別以外的分類(lèi)任務(wù)。 |
?
3.2 Early 2000: Stagnation of CNN
| ? ? ? ?In the late 1990s and early 2000s, interest in NNs reduced and less attention was given to explore the role of CNNs in different applications such as object detection, video surveillance, etc. Use of CNN in ML related tasks became dormant due to the insignificant improvement in performance at the cost of high computational time. At that time, other statistical methods and, in particular, SVM became more popular than CNN due to its relatively high performance [66]–[68]. It was widely presumed in early 2000 that the backpropagation algorithm used for training of CNN was not effective in converging to optimal points and therefore unable to learn useful features in supervised fashion as compared to handcrafted features [69]. Meanwhile, different researchers kept working on CNN and tried to optimize its performance. In 2003, Simard et al. improved CNN architecture and showed good results as compared to SVM on a hand digit benchmark dataset; MNIST [64], [68], [70]–[72]. This performance improvement expedited the research in CNN by extending its application in optical character recognition (OCR) to other script’s character recognition [72]–[74], deployment in image sensors for face detection in video conferencing, and regulation of street crimes, etc. Likewise, CNN based systems were industrialized in markets for tracking customers [75]–[77]. Moreover, CNN’s potential in other applications such as medical image segmentation, anomaly detection, and robot vision was also explored [78]–[80]. | ? ? ? ?在20世紀(jì)90年代末和21世紀(jì)初,人們對(duì)神經(jīng)網(wǎng)絡(luò)的興趣逐漸減少,對(duì)神經(jīng)網(wǎng)絡(luò)在目標(biāo)檢測(cè)、視頻監(jiān)控等不同應(yīng)用中的作用的研究也越來(lái)越少。由于性能上的顯著提高,在ML相關(guān)任務(wù)中使用神經(jīng)網(wǎng)絡(luò)以犧牲較高的計(jì)算時(shí)間而變得不活躍。當(dāng)時(shí),其他統(tǒng)計(jì)方法,特別是支持向量機(jī),由于其相對(duì)較高的性能而變得比CNN更受歡迎[66]-[68]。2000年初,人們普遍認(rèn)為,用于CNN訓(xùn)練的反向傳播算法在收斂到最優(yōu)點(diǎn)方面并不有效,因此與手工制作的特征相比,無(wú)法以監(jiān)督方式學(xué)習(xí)有用的特征[69]。與此同時(shí),不同的研究人員繼續(xù)研究CNN,并試圖優(yōu)化其性能。2003年,Simard等人。改進(jìn)了CNN的體系結(jié)構(gòu),與支持向量機(jī)相比,在一個(gè)手寫(xiě)數(shù)字基準(zhǔn)數(shù)據(jù)集上顯示了良好的結(jié)果;MNIST[64],[68],[70]–[72]。這種性能的提高加速了CNN的研究,將其在光學(xué)字符識(shí)別(OCR)中的應(yīng)用擴(kuò)展到其他腳本的字符識(shí)別[72]-[74],在視頻會(huì)議中部署用于面部檢測(cè)的圖像傳感器,以及對(duì)街頭犯罪的監(jiān)管等。同樣,基于CNN的系統(tǒng)也在市場(chǎng)上實(shí)現(xiàn)了工業(yè)化用于跟蹤客戶(hù)[75]–[77]。此外,CNN在醫(yī)學(xué)圖像分割、異常檢測(cè)和機(jī)器人視覺(jué)等其他應(yīng)用領(lǐng)域的潛力也得到了探索[78]-[80]。 |
?
3.3 2006-2011: Revival of CNN
| ? ? ? ?Deep NNs have generally complex architecture and time intensive training phase that sometimes spanned over weeks and even months. In early 2000, there were only a few techniques for the training of deep Networks. Additionally, it was considered that CNN is not able to scale for complex problems. These challenges halted the use of CNN in ML related tasks. | ? ? ? ?深度NNs通常具有復(fù)雜的結(jié)構(gòu)和時(shí)間密集型訓(xùn)練階段,有時(shí)跨越數(shù)周甚至數(shù)月。在2000年初,只有少數(shù)技術(shù)用于訓(xùn)練深層網(wǎng)絡(luò)。此外,有人認(rèn)為CNN無(wú)法擴(kuò)展到復(fù)雜的問(wèn)題。這些挑戰(zhàn)阻止了CNN在ML相關(guān)任務(wù)中的應(yīng)用。 ? ? ? ? ? ?? |
| ? ? ? To address these problems, in 2006 many interesting methods were reported to overcome the difficulties encountered in the training of deep CNNs and learning of invariant features. Hinton proposed greedy layer-wise pre-training approach in 2006, for deep architectures, which revived and reinstated the importance of deep learning [81], [82]. The revival of a deep learning [83], [84] was one of the factors, which brought deep CNNs into the limelight. Huang et al. (2006) used max pooling instead of subsampling, which showed good results by learning of invariant features [46], [85].? ? ? ? ? ? ? ? ? ? | 為了解決這些問(wèn)題,2006年報(bào)道了許多有趣的方法來(lái)克服在訓(xùn)練深層CNNs和學(xué)習(xí)不變特征方面遇到的困難。Hinton在2006年提出了貪婪的分層預(yù)訓(xùn)練方法,用于深層架構(gòu),這恢復(fù)了深層學(xué)習(xí)的重要性[81],[82]。深度學(xué)習(xí)的復(fù)興[83],[84]是其中的一個(gè)因素,這使深度cnn成為了焦點(diǎn)。Huang等人。(2006)使用最大值池代替子采樣,通過(guò)學(xué)習(xí)不變特征顯示了良好的結(jié)果[46],[85] |
| ? ? ? ? In late 2006, researchers started using graphics processing units (GPUs) [86], [87] to accelerate training of deep NN and CNN architectures [88], [89]. In 2007, NVIDIA launched the CUDA programming platform [90], [91], which allows exploitation of parallel processing capabilities of GPU with a much greater degree [92]. In essence, the use of GPUs for NN training [88], [93] and other hardware improvements were the main factor, which revived the research in CNN. In 2010, Fei-Fei Li’s group at Stanford, established a large database of images known as ImageNet, containing millions of labeled images [94]. This database was coupled with the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions, where the performances of various models have been evaluated and scored [95]. Consequently, ILSVRC and NIPS have been very active in strengthening research and increasing the use of CNN and thus making it popular. This was a turning point in improving the performance and increasing the use of CNN. | 2006年末,研究人員開(kāi)始使用圖形處理單元(GPU)[86],[87]來(lái)加速深度神經(jīng)網(wǎng)絡(luò)和CNN架構(gòu)的訓(xùn)練[88],[89]。2007年,NVIDIA推出了CUDA編程平臺(tái)[90],[91],它允許在更大程度上利用GPU的并行處理能力[92]。從本質(zhì)上講,GPUs在神經(jīng)網(wǎng)絡(luò)訓(xùn)練中的應(yīng)用[88]、[93]和其他硬件的改進(jìn)是主要因素,這使CNN的研究重新活躍起來(lái)。2010年,李飛飛在斯坦福大學(xué)的團(tuán)隊(duì)建立了一個(gè)名為ImageNet的大型圖像數(shù)據(jù)庫(kù),其中包含數(shù)百萬(wàn)個(gè)標(biāo)記圖像[94]。該數(shù)據(jù)庫(kù)與年度ImageNet大型視覺(jué)識(shí)別挑戰(zhàn)賽(ILSVRC)相結(jié)合,對(duì)各種模型的性能進(jìn)行了評(píng)估和評(píng)分[95]。因此,ILSVRC和NIPS在加強(qiáng)研究和增加CNN的使用方面非常積極,從而使其流行起來(lái)。這是一個(gè)轉(zhuǎn)折點(diǎn),在提高性能和增加使用有線電視新聞網(wǎng)。 |
?
3.4 2012-2014: Rise of CNN
| ? ? ? ?Availability of big training data, hardware advancements, and computational resources contributed to advancement in CNN algorithms. Renaissance of CNN in object detection, image classification, and segmentation related tasks had been observed in this period [9], [96]. However, the success of CNN in image classification tasks was not only due to the result of aforementioned factors but largely contributed by the architectural modifications, parameter optimization, incorporation of regulatory units, and reformulation and readjustment of connections within the network [39], [42], [97]. | ? ? ? ?大訓(xùn)練數(shù)據(jù)的可用性、硬件的先進(jìn)性和計(jì)算資源有助于CNN算法的進(jìn)步。CNN在目標(biāo)檢測(cè)、圖像分類(lèi)和與分割相關(guān)的任務(wù)方面的復(fù)興在這一時(shí)期已經(jīng)被觀察到了[9],[96]。然而,CNN在圖像分類(lèi)任務(wù)中的成功不僅是由于上述因素的結(jié)果,而且在很大程度上是由于結(jié)構(gòu)的修改、參數(shù)的優(yōu)化、調(diào)節(jié)單元的合并以及網(wǎng)絡(luò)內(nèi)連接的重新制定和調(diào)整[39]、[42]、[97]。γi |
| ? ? ? ?The main breakthrough in CNN performance was brought by AlexNet [21]. AlexNet won the 2012-ILSVRC competition, which has been one of the most difficult challenges in image detection and classification. AlexNet improved performance by exploiting depth (incorporating multiple levels of transformation) and introduced regularization term in CNN. The exemplary performance of AlexNet [21] compared to conventional ML techniques in 2012-ILSVRC (AlexNet reduced error rate from 25.8 to 16.4) suggested that the main reason of the saturation in CNN performance before 2006 was largely due to the unavailability of enough training data and?computational resources. In summary, before 2006, these resource deficiencies made it hard to train a high-capacity CNN without deterioration of performance [98].? ? ? ? ? ? ? | CNN的主要突破是由AlexNet帶來(lái)的[21]。AlexNet贏得了2012-ILSVRC比賽,這是圖像檢測(cè)和分類(lèi)領(lǐng)域最困難的挑戰(zhàn)之一。AlexNet利用深度(包含多個(gè)層次的轉(zhuǎn)換)提高了性能,并在CNN中引入了正則化項(xiàng)。與2012-ILSVRC(AlexNet將錯(cuò)誤率從25.8降低到16.4)中的傳統(tǒng)ML技術(shù)相比,AlexNet的示例性性能[21]表明,2006年之前CNN性能飽和的主要原因是缺乏足夠的訓(xùn)練數(shù)據(jù)和計(jì)算資源。總之,在2006年之前,這些資源不足使得在不降低性能的情況下難以訓(xùn)練高容量CNN[98] |
| ? ? ? With CNN becoming more of a commodity in the computer vision (CV) field, a number of attempts have been made to improve the performance of CNN with reduced computational cost. Therefore, each new architecture try to overcome the shortcomings of previously proposed architecture in combination with new structural reformulations. In year 2013 and 2014, researchers mainly focused on parameter optimization to accelerate CNN performance in a range of applications with a small increase in computational complexity. In 2013, Zeiler and Fergus [28] defined a mechanism to visualize learned filters of each CNN layer. Visualization approach was used to improve the feature extraction stage by reducing the size of the filters. Similarly, VGG architecture [29] proposed by the Oxford group, which was runner-up at the 2014-ILSVRC competition, made the receptive field much smaller in comparison to that of AlexNet but, with increased volume. In VGG, depth was increased from 9 layers to 16, by making the volume of features maps double at each layer. In the same year, GoogleNet [99] that won 2014-ILSVRC competition, not only exerted its efforts to reduce computational cost by changing layer design, but also widened the width in compliance with depth to improve CNN performance. GoogleNet introduced the concept of split, transform, and merge based blocks, within which multiscale and multilevel transformation is incorporated to capture both local and global information [33], [99], [100]. The use of multilevel transformations helps CNN in tackling details of images at various levels. In the year 2012-14, the main improvement in the learning capacity of CNN was achieved by increasing its depth and parameter optimization strategies. This suggested that the depth of a CNN helps in improving the performance of a classifier. | 隨著CNN在計(jì)算機(jī)視覺(jué)(CV)領(lǐng)域的應(yīng)用越來(lái)越廣泛,人們?cè)诮档陀?jì)算成本的前提下,對(duì)CNN的性能進(jìn)行了許多嘗試。因此,每一個(gè)新的架構(gòu)都試圖結(jié)合新的結(jié)構(gòu)重組來(lái)克服先前提出的建筑的缺點(diǎn)。在第2013和2014年,研究人員主要集中在參數(shù)優(yōu)化,以加速CNN在一系列應(yīng)用中的性能,計(jì)算復(fù)雜性的增加很小。2013年,Zeiler和Fergus[28]定義了一種機(jī)制,可以可視化每個(gè)CNN層的學(xué)習(xí)過(guò)濾器。采用可視化的方法,通過(guò)減小濾波器的尺寸來(lái)改善特征提取階段。同樣,在2014-ILSVRC競(jìng)賽中獲得亞軍的Oxford group提出的VGG架構(gòu)[29]也使得接受場(chǎng)比AlexNet小得多,但隨著體積的增加。在VGG中,深度從9層增加到16層,使每層的特征地圖體積加倍。同年,贏得2014-ILSVRC競(jìng)賽的GoogleNet[99]不僅努力通過(guò)改變層設(shè)計(jì)來(lái)降低計(jì)算成本,還根據(jù)深度拓寬了寬度以提高CNN性能。GoogleNet引入了基于分割、變換和合并的塊的概念,其中結(jié)合了多尺度和多級(jí)變換來(lái)捕獲局部和全局信息[33]、[99]、[100]。多級(jí)轉(zhuǎn)換的使用有助于CNN處理不同層次的圖像細(xì)節(jié)。2012-2014年,CNN的學(xué)習(xí)能力主要通過(guò)提高其深度和參數(shù)優(yōu)化策略來(lái)實(shí)現(xiàn)。這表明CNN的深度有助于提高分類(lèi)器的性能。 |
?
3.5 2015-Present: Rapid increase in Architectural Innovations and Applications of CNN
| ? ? ? ?It is generally observed the major improvements in CNN performance occurred from 2015-2019. The research in CNN is still on going and has a significant potential of improvement. Representational capacity of CNN depends on its depth and in a sense can help in learning complex problems by defining diverse level of features ranging from simple to complex. Multiple levels of transformation make learning easy by chopping complex problems into?15 smaller modules. However, the main challenge faced by deep architectures is the problem of negative learning, which occurs due to diminishing gradient at lower layers of the network. To handle this problem, different research groups worked on readjustment of layers connections and design of new modules. In earlier 2015, Srivastava et al. used the concept of cross-channel connectivity and information gating mechanism to solve the vanishing gradient problem and to improve the network representational capacity [101]–[103]. This idea got famous in late 2015 and a similar concept of residual blocks or skip connections was coined [31]. Residual blocks are a variant of cross-channel connectivity, which smoothen learning by regularizing the flow of information across blocks [104]–[106]. This idea was used in ResNet architecture for the training of 150 layers deep network [31]. The idea of cross-channel connectivity is further extended to multilayer connectivity by Deluge, DenseNet, etc. to improve representation [107], [108]. | ? ? ? ?一般觀察到,CNN在2015-2019年的表現(xiàn)出現(xiàn)了重大改善。CNN的研究仍在進(jìn)行中,有很大的改進(jìn)潛力。CNN的表征能力取決于它的深度,在某種意義上可以通過(guò)定義從簡(jiǎn)單到復(fù)雜的不同層次的特征來(lái)幫助學(xué)習(xí)復(fù)雜的問(wèn)題。通過(guò)將復(fù)雜的問(wèn)題分解成15個(gè)較小的模塊,多層次的轉(zhuǎn)換使學(xué)習(xí)變得容易。然而,深度架構(gòu)面臨的主要挑戰(zhàn)是負(fù)學(xué)習(xí)問(wèn)題,這是由于網(wǎng)絡(luò)較低層的梯度減小而產(chǎn)生的。為了解決這個(gè)問(wèn)題,不同的研究小組致力于重新調(diào)整層連接和設(shè)計(jì)新的模塊。2015年初,Srivastava等人。利用跨通道連接和信息選通機(jī)制的概念解決了消失梯度問(wèn)題,提高了網(wǎng)絡(luò)的表示能力[101]–[103]。這一想法在2015年末變得很有名,并創(chuàng)造了類(lèi)似的剩余塊或跳過(guò)連接的概念[31]。剩余塊是跨信道連接的一種變體,它通過(guò)調(diào)整跨塊的信息流來(lái)平滑學(xué)習(xí)[104]–[106]。該思想被用于ResNet體系結(jié)構(gòu)中,用于150層深度網(wǎng)絡(luò)的訓(xùn)練[31]。為了改進(jìn)表示[107]、[108],通過(guò)Deluge、DenseNet等將跨信道連接的思想進(jìn)一步擴(kuò)展到多層連接。γi |
| ? ? ? ? In the year 2016, the width of the network was also explored in connection with depth to improve feature learning [34], [35]. Apart from this, no new architectural modification became prominent but instead, different researchers used hybrid of the already proposed architectures to improve deep CNN performance [33], [104]–[106], [109], [110]. This fact gave the intuition that there might be other factors more important as compared to the appropriate assembly of the network units that can effectively regulate CNN performance. In this regard, Hu et al. (2017) identified that the network representation has a role in learning of deep CNNs [111]. Hu et al. introduced the idea of feature map exploitation and pinpointed that less informative and domain extraneous features may affect the performance of the network to a larger extent. He exploited the aforementioned idea and proposed new architecture named as Squeeze and Excitation Network (SE-Network) [111]. It exploits feature map (commonly known as channel in literature) information by designing a specialized SE-block. This block assigns weight to each feature map depending upon its contribution in class discrimination. This idea was further investigated by different researchers, which assign attention to important regions by exploiting both spatial and feature map (channel) information [37], [38], [112]. In 2018, a new idea of channel boosting was introduced by Khan et al [36]. The motivation behind the training of network with boosted channel representation was to use an enriched representation. This idea effectively boost the performance of a CNN by learning diverse features as well as exploiting the already learnt features through the concept of TL. | 2016年,還結(jié)合深度探索了網(wǎng)絡(luò)的寬度,以改進(jìn)特征學(xué)習(xí)[34],[35]。除此之外,沒(méi)有新的架構(gòu)修改變得突出,但相反,不同的研究人員使用已經(jīng)提出的架構(gòu)的混合來(lái)改進(jìn)深層CNN性能[33]、[104]–[106]、[109]、[110]。這一事實(shí)給人的直覺(jué)是,與能夠有效調(diào)節(jié)CNN性能的網(wǎng)絡(luò)單元的適當(dāng)組裝相比,可能還有其他因素更重要。在這方面,胡等人。(2017)確定了網(wǎng)絡(luò)代表在學(xué)習(xí)深層CNN方面的作用[111]。Hu等人。介紹了特征圖的開(kāi)發(fā)思想,指出信息量小、領(lǐng)域無(wú)關(guān)的特征對(duì)網(wǎng)絡(luò)性能的影響較大。他利用了上述思想,提出了一種新的結(jié)構(gòu),稱(chēng)為擠壓激勵(lì)網(wǎng)絡(luò)(SE網(wǎng)絡(luò))[111]。它通過(guò)設(shè)計(jì)一個(gè)專(zhuān)門(mén)的SE塊來(lái)開(kāi)發(fā)特征映射(在文獻(xiàn)中通常稱(chēng)為通道)信息。此塊根據(jù)其在類(lèi)別識(shí)別中的貢獻(xiàn)為每個(gè)特征映射分配權(quán)重。不同的研究者對(duì)此進(jìn)行了進(jìn)一步的研究,他們利用空間和特征地圖(通道)信息將注意力分配到重要區(qū)域[37]、[38]、[112]。2018年,Khan等人[36]提出了一種新的渠道提升理念。提高渠道表征的網(wǎng)絡(luò)訓(xùn)練背后的動(dòng)機(jī)是使用豐富的表征。這一思想通過(guò)學(xué)習(xí)不同的特征以及通過(guò)TL的概念利用已經(jīng)學(xué)習(xí)的特征,有效地提高了CNN的性能 |
| ? ? ? ?From 2012 up till now, a lot of improvements have been reported in CNN architecture. As regards the architectural advancement of CNNs, recently the focus of research has been on designing of new blocks that can boost network representation by exploiting both feature maps and spatial information or by adding artificial channels. | 從2012年到現(xiàn)在,CNN的架構(gòu)有很多改進(jìn)。關(guān)于CNNs的體系結(jié)構(gòu)進(jìn)展,近年來(lái)的研究重點(diǎn)是設(shè)計(jì)新的塊,通過(guò)利用特征圖和空間信息或添加人工通道來(lái)增強(qiáng)網(wǎng)絡(luò)表示。 |
?
總結(jié)
以上是生活随笔為你收集整理的CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Interview:算法岗位面试—11.
- 下一篇: Interview:算法岗位面试—11.