alexnet 结构_AlexNet的体系结构和实现
alexnet 結(jié)構(gòu)
In my last blog, I gave a detailed explanation of the LeNet-5 architecture. In this blog, we’ll explore the enhanced version of it which is AlexNet.
在上一個博客中,我詳細介紹了 LeNet-5體系結(jié)構(gòu) 。 在此博客中,我們將探索它的增強版本AlexNet。
AlexNet was the winner of the 2012 Imagenet Large Scale Visual Recognition Challenge(ILSVRC-2012) submitted by Alex Krizhevsky, Ilya Sutskever, and. Geoffrey E. Hinton and this model beat its nearest contender by more than a 10% error rate. These visual recognition challenges encourage the researchers to monitor the progress of computer vision research across the globe.
一個 lexNet提交由Alex Krizhevsky, 伊利亞Sutskever 2012年Imagenet大型視覺識別挑戰(zhàn)(ILSVRC-2012)的冠軍,和。 杰弗里·欣頓(Geoffrey E. Hinton)和這個模型以10%以上的錯誤率擊敗了其最接近的競爭者。 這些視覺識別挑戰(zhàn)鼓勵研究人員監(jiān)視全球計算機視覺研究的進展。
Before we proceed further let’s discuss the data present in the Imagenet dataset. It contains images from dogs, horses, cars, etc. It contains 1000 classes and each class contains thousands of images. So in total, there are approximately 1.2 million high-resolution images in this dataset used by researchers for training, testing, and validating the model designed by researchers.
在繼續(xù)進行之前,讓我們討論Imagenet數(shù)據(jù)集中存在的數(shù)據(jù)。 它包含來自狗,馬,汽車等的圖像。它包含1000個類別 ,每個類別包含數(shù)千個圖像。 因此,在該數(shù)據(jù)集中,總共有大約120萬張高分辨率圖像被研究人員用于訓(xùn)練,測試和驗證研究人員設(shè)計的模型。
Let’s dive into the AlexNet Architecture
讓我們深入研究AlexNet架構(gòu)
Photo by Dylan Nolte on UnsplashD ylan Nolte在Unsplash上拍攝的照片The AlexNet neural network architecture consists of 8 learned layers of which 5 are convolution layers, few are max-pooling layers, 3 are fully connected layers, and the output layer is a 1000 channel softmax layer. The pooling used here is Max pool.
AlexNet神經(jīng)網(wǎng)絡(luò)體系結(jié)構(gòu)由8個 學(xué)習(xí)層組成 ,其中5個是卷積層 ,很少是最大合并層 , 3個是全連接層,輸出層是1000通道softmax層。 這里使用的池是最大池。
Why 1000 channels of softmax layer are taken??
為什么要使用1000個softmax層通道?
This is because the Imagenet dataset contains 1000 different classes of images, so at the final output layer we have one node for each of these 1000 categories and the output layer is the softmax output layer.
這是因為Imagenet數(shù)據(jù)集包含1000種不同類別的圖像,因此在最終輸出層中,這1000個類別中的每一個都有一個節(jié)點,并且輸出層是softmax輸出層。
The basic architecture of AlexNet is as shown below:
AlexNet的基本體系結(jié)構(gòu)如下所示:
Image from Anh H Reynolds Blog圖片來自Anh H Reynolds博客The input to the AlexNet network is a 227 x 227 size RGB image, so it’s having 3 different channels- red, green, and blue.
AlexNet網(wǎng)絡(luò)的輸入是227 x 227尺寸的RGB圖像,因此它具有3個不同的通道-紅色,綠色和藍色。
Then we have the First Convolution Layer in the AlexNet that has 96 different kernels with each kernel’s size of 11 x 11 and with stride equals 4. So the output of the first convolution layer gives you 96 different channels or feature maps because there are 96 different kernels and each feature map contains features of size 55 x 55.
然后在AlexNet中有第一個卷積層 ,其中有96個不同的內(nèi)核,每個內(nèi)核的大小為11 x 11,步幅等于4。所以第一個卷積層的輸出為您提供96個不同的通道或特征圖,因為有96種不同的內(nèi)核和每個要素圖都包含大小為55 x 55的要素。
計算: (Calculations:)
Size of Input: N = 227 x 227
輸入大小:N = 227 x 227
Size of Convolution Kernels: f = 11 x 11
卷積核的大小:f = 11 x 11
No. of Kernels: 96
仁數(shù): 96
Strides: S = 4
步幅:S = 4
Padding: P = 0
填充:P = 0
Size of each feature map = [(N — f + 2P)/S] + 1
每個特征圖的大小= [(N — f + 2P)/ S] + 1
Size of each feature map = (227–11+0)/4+1 = 55
每個特征圖的大小= (227–11 + 0)/ 4 + 1 = 55
So every feature map after the first convolution layer is of the size 55 x 55.
因此,第一個卷積層之后的每個要素圖的大小均為55 x 55。
After this convolution, we have an Overlapping Max Pool Layer, where the max-pooling is done over a window of 3 x 3, and stride equals 2. So, here we’ll find that as our max pooling window is of size 3 x 3 but the stride is 2 that means max pooling will be done over an overlapped window. After this pooling, the size of the feature map is reduced to 27 x 27 and the number of feature channels remains 96.
卷積之后,我們有了一個重疊的最大池層,其中最大池在3 x 3的窗口上完成,步幅等于2。因此,在這里,我們會發(fā)現(xiàn),由于我們的最大池窗口的大小為3 x 3,但跨度為2,這意味著將在重疊的窗口上進行最大池化。 合并之后,要素地圖的大小將減小為27 x 27,要素通道的數(shù)量仍為96。
計算: (Calculations:)
Size of Input: N = 55 x 55
輸入大小:N = 55 x 55
Size of Convolution Kernels: f = 3 x 3
卷積核的大小:f = 3 x 3
Strides: S = 2
步幅:S = 2
Padding: P = 0
填充:P = 0
Size of each feature map = (55–3+0)/2+1 = 27
每個特征圖的大小= (55–3 + 0)/ 2 + 1 = 27
So every feature map after this pooling is of the size 27 x 27.
因此,合并后的每個要素地圖的大小均為27 x 27。
Then we have Second Convolution Layer where kernel size is 5 x 5 and in this case, we use the padding of 2 so that the output of the convolution layer remains the same as the input feature size. Thus, the size of feature maps generated by this second convolution layer is 27 x 27 and the number of kernels used in this case is 256 so that means from this convolution layer output we’ll get 256 different channels or feature maps and every feature map will be of size 27 x 27.
然后我們有了第二卷積層 ,其內(nèi)核大小為5 x 5,在這種情況下,我們使用填充2,以便卷積層的輸出與輸入要素大小保持相同。 因此,第二個卷積層生成的特征圖的大小為27 x 27,在這種情況下使用的內(nèi)核數(shù)為256,這意味著從該卷積層輸出中,我們將獲得256個不同的通道或特征圖,每個特征圖尺寸為27 x 27。
計算: (Calculations:)
Size of Input: N = 27 x 27
輸入大小:N = 27 x 27
Size of Convolution Kernels: f = 5 x 5
卷積核的大小:f = 5 x 5
No. of Kernels: 256
仁數(shù): 256
Strides: S = 1
步幅:S = 1
Padding: P = 2
填充:P = 2
Size of each feature map = (27–5+4)/1+1 = 27
每個特征圖的大小= (27-5 + 4)/ 1 + 1 = 27
So every feature map after the second convolution layer is of the size 27 x 27.
因此,第二個卷積層之后的每個特征圖的大小均為27 x 27。
Now again we have an Overlapping Max Pool Layer, where the max-pooling is again done over a window of 3 x 3, and stride equals 2 which means max-pooling is done over overlapping windows and output of this become 13 x 13 feature maps and number of channels we get is 256.
現(xiàn)在我們又有了一個“ 重疊最大池層”,其中最大池再次在3 x 3的窗口上完成,步幅等于2,這意味著最大池在重疊的窗口上進行,其輸出變?yōu)?3 x 13特征圖我們獲得的頻道數(shù)是256。
計算: (Calculations:)
Size of Input: N = 27 x 27
輸入大小:N = 27 x 27
Size of Convolution Kernels: f = 3 x 3
卷積核的大小:f = 3 x 3
Strides: S = 2
步幅:S = 2
Padding: P = 0
填充:P = 0
Size of each feature map = (27–3+0)/2+1 = 13
每個特征圖的大小= (27–3 + 0)/ 2 + 1 = 13
So every feature map after this pooling is of the size 13 x13.
因此,此池化后的每個特征圖的大小均為13 x13。
Then we have Three Consecutive Convolution Layers of which the first convolution layer is having the kernel size of 3 x 3 with padding equal to 1 and 384 kernels give you 384 feature maps of size 13 x 13 which passes through the next convolution layer.
然后我們有三個連續(xù)的卷積層 ,其中第一個卷積層的內(nèi)核大小為3 x 3,邊距等于1,384個內(nèi)核給您384個大小為13 x 13的特征貼圖,這些特征貼圖將通過下一個卷積層。
計算: (Calculations:)
Size of Input: N = 13 x 13
輸入大小:N = 13 x 13
Size of Convolution Kernels: f = 3 x 3
卷積核的大小:f = 3 x 3
No. of Kernels: 384
仁數(shù): 384
Strides: S = 1
步幅:S = 1
Padding: P = 1
填充:P = 1
Size of each feature map = (13–3+2)/1+1 = 13
每個特征圖的大小= (13–3 + 2)/ 1 + 1 = 13
In the second convolution, the kernel size is 13 x 13 with padding equal to 1 and it has 384 number of kernels that means the output of this convolution layer will have 384 channels or 384 feature maps and every feature map is of size 13 x 13. As we have given padding equals 1 for a 3 x 3 kernel size and that’s the reason the size of every feature map at the output of this convolution layer is remaining the same as the size of the feature maps which are inputted to this convolution layer.
在第二次卷積中 ,內(nèi)核大小為13 x 13,填充等于1,并且具有384個內(nèi)核數(shù),這意味著該卷積層的輸出將具有384個通道或384個特征圖,每個特征圖的大小為13 x 13 。因為我們給定了3 x 3內(nèi)核大小的padding等于1,所以這就是在此卷積層輸出處每個特征圖的大小與輸入到該卷積層中的特征圖的大小相同的原因。
計算: (Calculations:)
Size of Input: N = 13 x 13
輸入大小:N = 13 x 13
Size of Convolution Kernels: f = 3 x 3
卷積核的大小:f = 3 x 3
No. of Kernels: 384
仁數(shù): 384
Strides: S = 1
步幅:S = 1
Padding: P = 1
填充:P = 1
Size of each feature map = (13–3+2)/1+1 = 13
每個特征圖的大小= (13–3 + 2)/ 1 + 1 = 13
The output of this second convolution is again passed through a convolution layer where kernel size is again 3 x 3 and padding equal to 1 which means the output of this convolution layer generates feature maps of the same size of 13 x 13. But in this case, AlexNet uses 256 kernels so that means at the input of this convolution we have 384 channels which now get converted to 256 channels or we can say 256 feature maps are generated at the end of this convolution and every feature map is of size 13 x 13.
第二個卷積的輸出再次通過一個卷積層,其中內(nèi)核大小再次為3 x 3,填充等于1,這意味著該卷積層的輸出將生成相同大小的13 x 13的特征圖。但是在這種情況下,AlexNet使用256個內(nèi)核,這意味著在該卷積的輸入處,我們有384個通道現(xiàn)在已轉(zhuǎn)換為256個通道,或者可以說在該卷積結(jié)束時生成了256個特征圖,每個特征圖的大小為13 x 13 。
計算: (Calculations:)
Size of Input: N = 13 x 13
輸入大小:N = 13 x 13
Size of Convolution Kernels: f = 3 x 3
卷積核的大小:f = 3 x 3
No. of Kernels: 256
仁數(shù): 256
Strides: S = 1
步幅:S = 1
Padding: P = 1
填充:P = 1
Size of each feature map = (13–3+2)/1+1 = 13
每個特征圖的大小= (13–3 + 2)/ 1 + 1 = 13
Followed by the above is the next Overlapping Max Pool Layer, where the max-pooling is again done over a window of 3 x 3 and stride equal to 2 and that gives you the output feature maps and the number of channels remains same as 256 and the size of each feature map is 6 x 6.
緊隨其后的是下一個重疊的最大池層,其中最大池化再次在3 x 3的窗口內(nèi)完成,步幅等于2,這將為您提供輸出要素圖,并且通道數(shù)保持與256和每個要素圖的大小為6 x 6。
計算: (Calculations:)
Size of Input: N = 13 x 13
輸入大小:N = 13 x 13
Size of Convolution Kernels: f = 3 x 3
卷積核的大小:f = 3 x 3
Strides: S = 2
步幅:S = 2
Padding: P = 0
填充:P = 0
Size of each feature map = (13–3+0)/2+1 = 6
每個特征圖的大小= (13–3 + 0)/ 2 + 1 = 6
Now we have a fully connected layer which is the same as a multi-layer perception. The first two fully-connected layers have 4096 nodes each. After the above mentioned last max-pooling, we have a total of 6*6*256 i.e. 9216 nodes or features and each of these nodes is connected to each of the nodes in this fully-connected layer. So the number of connections we’ll have in this case is 9216*4096. And then every node from this fully connected convolution layer provides input to every node in the second fully connected layer. So here we’ll have a total of 4096*4096 connections as in the second fully connected layer also we have 4096 nodes.
現(xiàn)在我們有了一個完全連接的層,它與多層感知相同。 前兩個完全連接的層各有4096個節(jié)點。 在上述最后一個最大池之后,我們總共有6 * 6 * 256,即9216個節(jié)點或特征,并且這些節(jié)點中的每一個都連接到此完全連接層中的每個節(jié)點。 因此,本例中的連接數(shù)為9216 * 4096。 然后,該完全連接的卷積層中的每個節(jié)點都會向第二個完全連接層中的每個節(jié)點提供輸入。 因此,在這里,我們總共有4096 * 4096個連接,因為在第二個完全連接的層中,我們還有4096個節(jié)點。
And then, in the end, we have an output layer with 1000 softmax channels. Thus the number of connections between the second fully connected layer and the output layer is 4096*1000.
最后,我們有了一個包含1000個softmax通道的輸出層。 因此,第二完全連接層和輸出層之間的連接數(shù)為4096 * 1000。
Training on multiple GPUs
在多個GPU上訓(xùn)練
Original Image published in [AlexNet-2012]原始圖片發(fā)表在[AlexNet-2012]As we can see from the figure that inter-AlexNet was implemented in two channels because 1.2 million training examples were too big to fit on one GPU. So half of the network is put in one channel and the other half of the network is put into another channel. And as they are into two different channels so that made it possible to train this network on two different GPU cards. The GPU used was GTX 580 3GB GPUs and the network took between five to six days to get trained. Here cross-GPU parallelization (i.e. One GPU communicating with other GPU) is happening at some places like kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which stay on the same GPU.
從圖中可以看出,AlexNet的實現(xiàn)是通過兩個渠道實現(xiàn)的,因為120萬個訓(xùn)練示例太大而無法放在一個GPU上。 因此,網(wǎng)絡(luò)的一半放在一個通道中,網(wǎng)絡(luò)的另一半放在另一個通道中。 由于它們進入兩個不同的通道,因此可以在兩個不同的GPU卡上訓(xùn)練該網(wǎng)絡(luò)。 使用的GPU是GTX 580 3GB GPU,并且網(wǎng)絡(luò)花費了五到六天的時間進行培訓(xùn)。 這里在某些地方發(fā)生了跨GPU并行化 (即一個GPU與其他GPU通信),例如第3層的內(nèi)核從第2層的所有內(nèi)核映射獲取輸入。但是,第4層的內(nèi)核僅從第3層的那些內(nèi)核映射獲取輸入。保持在同一GPU上。
消失梯度問題 (Vanishing Gradient Problem)
If we use the non -linear activation function like sigmoidal or tan hyperbolic (tanh) function then it gives a risk of vanishing gradient that is in some cases when we are training the network with the gradient descent procedure vanishing gradient means that the gradient of the error function may become too small such that using that gradient when you try to update the network parameters, the update becomes almost negligible because the gradient itself is very small, that is what is vanishing gradient problem.
如果我們使用諸如S形或tan雙曲線(tanh)函數(shù)之類的非線性激活函數(shù),則可能會出現(xiàn)梯度消失的風(fēng)險,在某些情況下,當(dāng)我們使用梯度下降過程訓(xùn)練網(wǎng)絡(luò)時,梯度消失會意味著錯誤函數(shù)可能變得太小,以致于當(dāng)您嘗試更新網(wǎng)絡(luò)參數(shù)時使用該梯度時,由于梯度本身很小,即梯度問題消失了,因此更新幾乎可以忽略不計。
Why does the Vanishing Gradient Problem arise??
為什么會出現(xiàn)消失梯度問題?
As we know with the graph of sigmoid function that if the value of an input is very high then it saturates to 1 and if the value of an input is too low then it saturates to 0. So when we try to take the gradient at these points the gradient is almost 0. The same thing is true if our non-linear activation function is tanh.
正如我們通過S型函數(shù)圖所知道的,如果輸入的值非常高,則其飽和度為1;如果輸入的值太低,則其飽和度為0。因此,當(dāng)我們嘗試在這些位置采用梯度時表示梯度幾乎為0。如果我們的非線性激活函數(shù)為tanh,則情況相同。
How to prevent Vanishing Gradient Problem??
如何防止消失梯度問題?
To prevent the vanishing gradient problem we use the relu (Rectified Linear Unit) activation function. As relu function is max(x, 0) so for x>0 the gradient is always constant so this is the advantage when we use relu as the non-linear activation function. Also, the training time using gradient descent with saturating nonlinearities like tanh and sigmoid is much larger as compared to non-saturating nonlinearity like relu. This can be seen in the following diagram where a four-layered convolution network with ReLUs (solid line) as an activation function reached a 25% training error rate on CIFAR10 dataset six times faster than the same network when ran with tanh (dashed line) as an activation function. Thus, network with relu as an activation function learns almost six times faster than saturating activation functions.
為了避免梯度消失的問題,我們使用relu(整流線性單元)激活函數(shù)。 由于relu函數(shù)是max(x,0),因此對于x> 0來說梯度總是恒定的,因此這在我們將relu用作非線性激活函數(shù)時具有優(yōu)勢。 同樣,與非飽和非線性(如relu)相比,使用具有飽和非線性(如tanh和Sigmoid)的梯度下降的訓(xùn)練時間要大得多。 在下圖中可以看到這一點,其中以ReLU(實線)作為激活函數(shù)的四層卷積網(wǎng)絡(luò)在CIFAR10數(shù)據(jù)集上達到25%的訓(xùn)練錯誤率,是使用tanh運行時(同上)的同一個網(wǎng)絡(luò)快六倍。作為激活功能。 因此,以relu作為激活功能的網(wǎng)絡(luò)的學(xué)習(xí)速度幾乎比飽和激活功能快六倍。
Original Image published in [AlexNet-2012]原始圖片發(fā)表在[AlexNet-2012]Problem using relu as an activation function
使用relu作為激活功能的問題
Unlike the sigmoidal and tanh activation function where the activation output is limited and bounded but in case of relu, the output is unbounded. As x increases the non-linear output of the activation function relu also increases. So to avoid this problem AlexNet tries to normalize the output of the convolution layer before applying the relu through a process known as Local Response Normalization (or LR Normalization).
與S型和tanh激活函數(shù)不同,后者的激活輸出受到限制和限制,但是在relu的情況下,輸出是不受限制的。 隨著x增加,激活函數(shù)relu的非線性輸出也增加。 因此,為避免此問題,AlexNet嘗試通過稱為Local 響應(yīng)歸一化(或LR歸一化)的過程在應(yīng)用relu之前歸一化卷積層的輸出。
本地響應(yīng)規(guī)范化 (Local Response Normalization)
Local Response Normalization is a type of normalization in which excited neurons are amplified while dampering the surrounding neurons at the same time in a local neighborhood. This particular operation is encouraged from a phenomenon known as lateral inhibition in neurobiology which indicates the capacity of a neuron to reduce the activity of its neighbors. So, the output of the convolution is first normalized before applying the non-linear activation function to limit the output values by unbounded activation function relu. It is a non-trainable layer in the network.
局部響應(yīng)歸一化是一種歸一化類型,其中受激神經(jīng)元被放大,同時在局部鄰域中同時衰減周圍的神經(jīng)元。 神經(jīng)生物學(xué)中的一種稱為側(cè)向抑制的現(xiàn)象鼓勵了這種特殊的操作,該現(xiàn)象表明神經(jīng)元減少其鄰居活動的能力。 因此,在應(yīng)用非線性激活函數(shù)以通過無界激活函數(shù)relu限制輸出值之前,先對卷積的輸出進行歸一化。 它是網(wǎng)絡(luò)中的不可訓(xùn)練層。
So this is how unbounded problem and vanishing gradient problem are prevented.
因此,這是如何防止無邊界問題和消失的梯度問題的方法。
Local Response Normalization can be done across the channel and also it can be done within a channel. When we do this normalization across the channel then it is known as Inter-channel normalization and when the normalization occurs between the features of the same channel it is known as Intra-channel normalization. Inter-channel normalization was performed in the AlexNet network. Based on the neighborhood two types of LRN can be seen from the below figure:
本地響應(yīng)規(guī)范化可以在整個通道上完成,也可以在通道內(nèi)完成。 當(dāng)我們跨通道執(zhí)行此歸一化時,則稱為通道間歸一化;而在同一通道的特征之間發(fā)生歸一化時,則稱為通道內(nèi)歸一化 。 通道間標(biāo)準(zhǔn)化是在AlexNet網(wǎng)絡(luò)中執(zhí)行的。 從下圖可以看出,基于鄰居的兩種類型的LRN:
Image from the blog by Aqeel Anwar圖片來自Aqeel Anwar的博客Inter-channel normalization: Here normalization is performed across the channels and hence here neighborhood is the depth of the channel. The normalization output at the position (x,y) is given by the following formula:
通道間歸一化:此處歸一化是在通道之間執(zhí)行的,因此此處鄰域是通道的深度。 位置(x,y)的歸一化輸出由以下公式給出:
Here, b[i, (x,y)] is the output at the location (x,y) in the ith channel and a[i, (x,y)] is the original value at location (x,y) in the ith channel. So in this, we normalize a[i, (x,y)] by a factor, and that factor is given by k plus alpha times sum of a[j, (x,y)] where this j varies between the neighboring channels. Thus j varies between a maximum of (0 and i-n/2) and a minimum of (N-1 and i+n/2) that means n/2 number of channels before and n/2 number of channels after i. 0 and N-1 are put to take care of the first and last channels. After this normalization, the output will be bounded and it can be shown with the subsequent figure.
這里,B [I,(X,Y)]是在該位置的輸出(X,Y)中的第i個信道和[I,(X,Y)]是在位置為初始值(X,Y)中第一個通道。 因此,在此,我們將a [i,(x,y)]歸一化,該因子由k加a [j,(x,y)]的 alpha乘和得出,其中j在相鄰?fù)ǖ乐g變化。 因此, j在最大值 ( 0和in / 2)和最小值 ( N-1和i + n / 2)之間變化 ,這意味著i之前為 n / 2個通道, i 之后為n / 2個通道 。 0和N-1用于處理第一個和最后一個通道。 進行此歸一化后,輸出將受到限制,并可以在下圖中顯示。
Image from the blog by Aqeel Anwar圖片來自Aqeel Anwar的博客Intra-channel normalization: Here normalization occurs between the neighboring neurons across the surface of the same channel.
通道內(nèi)歸一化:此處歸一化發(fā)生在同一通道表面上的相鄰神經(jīng)元之間。
Here, b[k, (x,y)] is the output at the location (x,y) in the kth channel and a[k, (x,y)] is the original value at location (x,y) in the kth channel. So in this, we normalize a[k, (x,y)] by a factor, and that factor is given by k plus alpha times summation of the sum of squares of feature values within the neighborhood of x and y. Min and Max are put to take care of the features which are at the boundary of the feature maps. After this normalization, the output will be bounded and it can be shown with the subsequent figure.
此處, b [k,(x,y)]是第k個通道中位置(x,y)的輸出,而a [k,(x,y)]是第k個通道中位置(x,y)的原始值第k個頻道。 因此,在此,我們用一個因子對a [k,(x,y)]進行歸一化,并且該因子由k加alpha乘以x和y附近的特征值平方和之和得出。 最小和最大用于處理位于特征圖邊界的特征。 進行此歸一化后,輸出將受到限制,并可以在下圖中顯示。
Image from the blog by Aqeel Anwar圖片來自Aqeel Anwar的博客NOTE: ‘ k ‘ is used in the factor of both the normalization to avoid division by zero situation and here ‘k’ and ‘a(chǎn)lpha’ are hyperparameters.
注意: “ k”用于規(guī)范化的因素,以避免被零除的情況,此處的“ k”和“ alpha”是超參數(shù) 。
過度擬合的問題 (Problem of Overfitting)
As there are 60 million parameters to be trained, so this would lead to the problem of overfitting which means the network would be able to learn or memorize the training data very properly but in case of testing input, it won’t be able to code the properties or features of the input data. So in such a case, the performance of the model in case of training may not be acceptable. So to reduce this overfitting issue additional augmented data was generated from the existing data and the augmentation(i.e. generation of new images from the original images by making variations like horizontal flipping, vertical flipping, zooming, etc. in the original images ) was done by mirroring and bt taking random crops from the input data. Another method by which the problem of overfitting was taken care of is by using dropout regularization.
由于有6000萬個要訓(xùn)練的參數(shù),因此這將導(dǎo)致過擬合的問題,這意味著網(wǎng)絡(luò)將能夠非常正確地學(xué)習(xí)或記憶訓(xùn)練數(shù)據(jù),但是在測試輸入的情況下,它將無法進行編碼輸入數(shù)據(jù)的屬性或特征。 因此,在這種情況下,在訓(xùn)練情況下模型的性能可能無法接受。 因此,為了減少這種過擬合的問題,可以從現(xiàn)有數(shù)據(jù)中生成額外的增強數(shù)據(jù),并通過以下方式進行增強(即通過對原始圖像進行水平翻轉(zhuǎn),垂直翻轉(zhuǎn),縮放等變化來從原始圖像生成新圖像) 鏡像和bt從輸入數(shù)據(jù)中獲取隨機作物 。 解決過擬合問題的另一種方法是使用輟學(xué)正則化 。
什么是輟學(xué)正規(guī)化? (What is Dropout Regularization??)
Image from Packt Subscription圖片來自Packt訂閱In this randomly selected neuron or randomly selected nodes, which are selected with a probability of 0.5 are dropped from the network temporarily. So the probability of the nodes that are removed and the probability of the nodes that are retained both will be equal to 0.5. So the dropout means that the node which has been dropped out will not pass its output to the subsequent nodes in the subsequent layers downstream and for the same nodes during backward propagation, no updation will take place as removing the nodes will also remove its subsequent connections as well. But this would increase the number of iterations required for training the model but this would make the model less vulnerable to overfitting thus, generalizing the model.
在該隨機選擇的神經(jīng)元或隨機選擇的節(jié)點中,以0.5的概率選擇的節(jié)點從網(wǎng)絡(luò)中暫時刪除。 因此,被刪除的節(jié)點的概率和被保留的節(jié)點的概率都等于0.5。 因此,丟棄意味著已丟棄的節(jié)點不會將其輸出傳遞到下游的后續(xù)層中的后續(xù)節(jié)點,并且對于反向傳播期間的同一節(jié)點,不會進行更新,因為刪除節(jié)點也會刪除其后續(xù)連接也一樣 但這會增加訓(xùn)練模型所需的迭代次數(shù),但這會使模型不易過擬合,從而使模型泛化。
有關(guān)Alexnet架構(gòu)的事實和數(shù)據(jù) (Facts and figures regarding Alexnet architecture)
The weight(w) updation rule was as follows:
權(quán)重( w )更新規(guī)則如下:
7. The initial weights to all the layers were assigned under Gaussian distribution with mean as 0 as standard deviation as 0.01. Also, the initial bias assigned was 1 to second, fourth, fifth convolution layers, and also to all the fully connected layers, but for all other layers, bias was assigned with 0.
7.在高斯分布下分配所有層的初始權(quán)重,平均值為0,標(biāo)準(zhǔn)差為0.01。 同樣,分配給第二,第四,第五卷積層的初始偏置為1,也分配給所有完全連接的層,但是對于所有其他層,偏置分配為0。
使用 Keras的lexNet代碼 (AlexNet code using Keras)
Here we are going to use the oxflower17 dataset prepared by Oxford and it’s present in the tflearn library.
在這里,我們將使用牛津大學(xué)編寫的oxflower17數(shù)據(jù)集,該數(shù)據(jù)集存在于tflearn庫中。
Also, one important thing to note here is that the image size present in tflearn library has the size of (224 x 224), so we’ll be using 224 x 224 image instaed of 227 x 227 which was used in the original architecture.
另外,這里要注意的一件事是tflearn庫中存在的圖像大小為(224 x 224),因此我們將使用224 x 224的insta圖片,即227 x 227,這是在原始體系結(jié)構(gòu)中使用的。
導(dǎo)入所需的庫 (Importing the required libraries)
加載數(shù)據(jù)集 (Loading the dataset)
檢查X和Y的形狀 (Checking the shape of X and Y)
建立模型 (Creating the model)
模型總結(jié) (Summary of the model)
- There are approx 46 million trainable parameters here as can be seen from the subsequent image. 從下圖可以看出,這里大約有4600萬個可訓(xùn)練參數(shù)。
編譯模型 (Compiling the model)
訓(xùn)練模型 (Training the model)
📌 To get the complete code of AlexNet or any other network visit my GitHub repository.
get 要獲取AlexNet或任何其他網(wǎng)絡(luò)的完整代碼,請訪問我的 GitHub存儲庫 。
[1]Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton — ImageNet Classification with Deep Convolutional Neural Networks(2012)
[1] Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton — 深度卷積神經(jīng)網(wǎng)絡(luò)的ImageNet分類 (2012年)
Thanks for reading. Hope this blog would have helped you with both the coding and understanding of the architecture. 😃
謝謝閱讀。 希望該博客對您的架構(gòu)編碼和理解有所幫助。 😃
翻譯自: https://medium.com/analytics-vidhya/the-architecture-implementation-of-alexnet-135810a3370
alexnet 結(jié)構(gòu)
總結(jié)
以上是生活随笔為你收集整理的alexnet 结构_AlexNet的体系结构和实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 有什么隐藏应用的软件
- 下一篇: DeepR —训练TensorFlow模