當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Batch Normalization原文详细解读

發布時間：2023/12/20 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Batch Normalization原文详细解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這篇博客分為兩部分，
一部分是[3]中對于BN(Batch Normalization的縮寫)的大致描述
一部分是原文[1]中的完整描述

、####################先說下書籍[3]############################################
Batch Normalization首次在[1]中提出,原文中沒有給出具體的圖示。
我們先來歸納下[3]中提到的Batch Normalization的大意:

上圖出自[3].
[1][2]討論了應該把Batch Normalization插入到激活函數的前面還是后面.
[2]中有一段話是用來吐槽[1]沒有把BN的插入位置寫清楚，摘錄如下：

5.2.1 WHERE TO PUT BN – BEFORE OR AFTER NON-LINEARITY?
It is not clear from the paper Ioffe & Szegedy (2015) where to put the batch-normalization layer before input of each layer as stated in Section 3.1, or before non-linearity, as stated in section 3.2,so we have conducted an experiment with FitNet4 on CIFAR-10 to clarify this.Results are shown in Table 5.
Exact numbers vary from run to run, but in the most cases, batch normalization put after non-linearity performs better.
In the next experiment we compare BN-FitNet4, initialized with Xavier and LSUV-initialized FitNet4. Batch-normalization reduces training time in terms of needed number of iterations, but each iteration becomes slower because of extra computations. The accuracy versus wall-clock-time graphs are shown in Figure 3.

注意：
[3]的附帶代碼沒有實現[1]中提到的對卷積層進行Batch Normalization轉化

Batch Normalization算法之一（因為這個是[1]提出中的Algorithm 1）：

講人話，什么意思呢？
上面圖中的Batch Norm層，輸入mini-batch中的m條數據，輸出 $y_i$ ，
mini-batch和 $y_i$ 之間的映射關系通過上面的Algorithm 1來實現
注意，書籍[3]沒有對反向傳播的BN進行論文，代碼也沒有相關實現，對于卷積層的BN轉化，該書籍也沒有實現。
Batch Normalization完整的代碼還是需要閱讀tensorflow的源碼

#######################再說下論文原文[1]結構#############################
論文結構：
$BatchNormalization={Abstract1.Introduction2.Towards?Reducint?Internal?Covariate?Shift3.Normalization?via?Mini-Batch?Statistics4.ExperimentsBatch\ Normalization=\left\{ \begin{aligned} Abstract \\ 1.Introduction \\ \text{2.Towards Reducint Internal Covariate Shift}\\ \text{3.Normalization via Mini-Batch Statistics}\\ \text{4.Experiments} \end{aligned} \right.$
∴論文的重點是2和3兩個section

#########先看下Introduction都講了些啥####################
Deep learning has dramatically advanced the state of the
art in vision, speech, and many other areas. Stochastic gradient descent (SGD) has proved to be an effec-
tive way of training deep networks, and SGD variants such as momentum (Sutskever et al., 2013) and Adagrad
(Duchi et al., 2011) have been used to achieve state of the art performance. SGD optimizes the parameters Θ of the
network, so as to minimize the loss
$Θ=arg?min?Θ1N∑i=1N?(xi,Θ)\Theta=\arg \min _{\Theta} \frac{1}{N} \sum_{i=1}^{N} \ell\left(\mathrm{x}_{i}, \Theta\right)$
where $x_{1...N}$ is the training data set. With SGD, the training proceeds in steps, and at each step we consider a mini-
batch $x_{1...m}$ of size $m$ . The mini-batch is used to approximate the gradient of the loss function with respect to the
parameters, by computing
$1m??(xi,Θ)?Θ\frac{1}{m} \frac{\partial \ell\left(\mathrm{x}_{i}, \Theta\right)}{\partial \Theta}$
（這里的 $Θ\Theta$ 就是神經網絡里面的權重。上面內容絕大多數都是套話）

Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient
of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch
size increases. Second, computation over a batch can be much more ef?cient than m computations for individual
examples, due to the parallelism afforded by the modern computing platforms.(這段話的意思是更新權重的時候使用全部數據效果會比較好，但是用batch的速度會更快。這段主要就是回顧了一些基礎知識)

While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters,speci?cally the learning rate used in optimization, as well as the initial values for the model parameters. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.
(最后一句是重點，淺層的小變化會引起深層的大變化）

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shif t (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift(這是這篇文章首次提到covariate shift) can be extended beyond the
learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing

$?=F2(F1(u,Θ1),Θ2)\ell=F_{2}\left(F_{1}\left(\mathrm{u}, \Theta_{1}\right), \Theta_{2}\right)$

here $F 1$ and $F 2$ are arbitrary transformations(這里指的是激活函數), and the parameters $Θ 1, Θ 2$ are to be learned so as to minimize the loss ?. Learning $Θ 2$ can be viewed as if the inputs $x = F 1 (u, Θ 1)$ are fed into the sub-network.
$?=F2(x,Θ2)\ell=F_{2}\left(\mathrm{x}, \Theta_{2}\right)$

For example, a gradient descent step
$Θ2←Θ2?αm∑i=1m?F2(xi,Θ2)?Θ2\Theta_{2} \leftarrow \Theta_{2}-\frac{\alpha}{m} \sum_{i=1}^{m} \frac{\partial F_{2}\left(\mathrm{x}_{i}, \Theta_{2}\right)}{\partial \Theta_{2}}$

(for batch sizem and learning rateα) is exactly equivalent
to that for a stand-alone network $F 2$ with input $x$ . Therefore, the input distribution properties that make training
more ef?cient – such as having the same distribution between the training and test data – apply to training the
sub-network as well. As such(嚴格來說，真正意義上) it is advantageous for the distribution ofx to remain ?xed over time. Then, $Θ 2$ does not have to readjust to compensate for the change in the distribution of x.
(以上一部分依然是在回顧基礎知識：如何更新權重，其中的 $F 1$ 和 $F 2$ 都是指激活函數)

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the sub-network, as well. Consider a layer with a sigmoid activation function z = g(Wu +b) where u is the layer input,the weight matrix W and bias vector b are the layer parameters to be learned, and $g(x)=11+exp(?x)g(x)=\frac{1}{1+exp(-x)}$ . As |x| increases,g′(x) tends to zero. This means that for all dimensions ofx = Wu+b except those with small absolute values, the gradient ?owing down tou will vanish and the model will train slowly. However, since x is affected by W,b and the parameters of all the layers below, changes to those parameters during training will likely move many dimensions of x into the saturated regime of the nonlinearity and slow down the convergence. This effect is ampli?ed as the network depth increases. In practice, the saturation problem and the resulting vanishing gradients are usually addressed by using Recti?ed Linear Units (Nair & Hinton, 2010) ReLU(x) = max(x,0), careful initialization (Bengio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
上面這段話的意思就是訓練的時候別進激活函數的飽和區，進入飽和區的話要再調整就很難，從而會導致增加訓練時間。
也就是說，權值穩定可能有兩種情況，1.收斂 2.梯度消失
關于梯度爆炸和梯度消失的概念可以參考[7]講得非常好。

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. (這里首次提到interl covariate shift)Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that ?xes the means and variances of layer inputs. Batch Normalization also has a bene?cial effect on the gradient ?ow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values.This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
注意，covariate shift和interl covariate shift的區別是前者是應用于整個系統，后者是應用于系統的各個部分(也就是各個層，但是每種層的處理方法都不一樣)。
這段文字講了一堆好處，但是其實這些好處都是作者意外發現的，并不是一開始就想到的。

In Sec. 4.2, we apply Batch Normalization to the bestperforming ImageNet classi?cation network, and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. Using an ensemble of such networks trained with Batch Normalization, we achieve the top-5 error rate that improves upon the best known results on
ImageNet classi?cation.
這段的重點是，采用BN措施以后，需要的訓練steps數量僅僅是原來的7%

###################introduction 結束######################################

#################Section 2開始####################

這個部分講了作者的探索，就是BN如果加在反向傳播以后好不好，作者認為不好，因為會抵消偏置。具體分析如下：
For example, consider a layer with the input u that adds the learned bias b, and normalizes the result by subtracting the mean of the activation computed over the training data：
$x^=x?E[x]\widehat{x}=x-E[x]$ where $\mathcal{X}=\left\{x_{1 \ldots N}\right\}$ is the set of values of $x$ over the training set, and

$E[x]=1N∑i=1Nxi\mathrm{E}[x]=\frac{1}{N} \sum_{i=1}^{N} x_{i}$ .
If a gradient descent step ignores the dependence of $E [x]$ on $b$ ,then it will update $\leftarrow b+\Delta b$ ,where $Δb∝???/?x^\Delta b \propto-\partial \ell / \partial \widehat{x}$ .Then $u+(b+Δb)?E[u+(b+Δb)]=u+b?E[u+b]u+(b+\Delta b)-\mathrm{E}[u+(b+\Delta b)]=u+b-\mathrm{E}[u+b]$

Thus, the combination of the update to b and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss.

上面的話什么意思呢？如果說BN加在BP后面，那么BP后每個b顯然都會增加一個 $Δb\Delta b$ ,每個E[u+b]也會增加一個 $Δb\Delta b$ ，一旦進行BN處理，那么這兩個東西一減，就把 $Δb\Delta b$ 抵消了，那BP的工作就白費了。

另外，稍微提下，這篇論文中：
inference step:前向傳播的意思，因為推斷的時候(infer)的時候需要從頭到尾計算一遍才能計算輸出值。
gradient step:反向傳播階段

這篇文章的毛病就是一些概念用詞沒有前后統一。

####################Section 2結束#################

###################section3################################
3.提出了上面的Algorithm 1以及BN在反向傳播時如何處理
3.1.提出算法2，實質是對調用了algorithm1 以后的整個神經網絡的工作流程的描述
3.2.BN在卷積層中的處理（一筆帶過，純文字描述）
3.3.和3.4說下BN的好處

3.1已經在下面提過了，重點來看下3中反向傳播的處理

$???x^i=???yi?γ\frac{\partial \ell}{\partial \widehat{x}_{i}}=\frac{\partial \ell}{\partial y_{i}} \cdot \gamma$

$???σB2=∑i=1m???x^i?(xi?μB)??12(σB2+?)?3/2\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot\left(x_{i}-\mu_{\mathcal{B}}\right) \cdot \frac{-1}{2}\left(\sigma_{\mathcal{B}}^{2}+\epsilon\right)^{-3 / 2}$

$???μB=(∑i=1m???x^i??1σB2+?)+???σB2?∑i=1m?2(xi?μB)m\frac{\partial \ell}{\partial \mu_{\mathcal{B}}}=\left(\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}\right)+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{\sum_{i=1}^{m}-2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}$

$???xi=???x^i?1σB2+?+???σB2?2(xi?μB)m+???μB?1m\frac{\partial \ell}{\partial x_{i}}=\frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}+\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}$

$???γ=∑i=1m???yi?x^i\frac{\partial \ell}{\partial \gamma}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}} \cdot \widehat{x}_{i}$

$???β=∑i=1m???yi\frac{\partial \ell}{\partial \beta}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}}$

雖然提到了訓練誤差 $l$ 關于 $γ\gamma$ , $β\beta$ 的求導方式，但是并沒有提到 $γ\gamma$ , $β\beta$ 的更新方式，因為這個論文是谷歌的人寫的，tensorflow是谷歌開發的，所以具體細節還是要去看tensorflow的代碼。

上述求導的轉化目標是有規律的，每一行都視上一行已經得到的偏導值為常數，并在這一行的求導過程中加以利用。

###################section3結束################################
再往后就是實驗了（略）

############################################################

論文的思路是把[5][6]中提到的白化操作應用到了神經網絡層的內部
論文提出了兩個算法，Algorithm 1就是上面的那個,Algorithm 2是下面這個：

先解釋下上面的 $mm?1\frac{m}{m-1}$ 咋回事。
使用的是無偏方差估計。
根據我們的概率論與數理統計知識可以得知：
$∵ E (S) = D (X)$
$∴EB(σB2)=DB(B)=1m∑i=1m[x?x￣]2①∴E_B(\sigma_B^2)=D_B(B)=\frac{1}{m}\sum_{i=1}^m[x-\overline{x}]^2①$
這里的m是一個batch中的數據數量，也就是m條數據
顯然我們需要的是這批batch的方差，而不是總體方差
∴ $σB2=1m?1∑i=1m[x?x￣]2=mm?1EB(σB2)\sigma_B^2=\frac{1}{m-1}\sum_{i=1}^m[x-\overline{x}]^2=\frac{m}{m-1}E_B(\sigma_B^2)$

注意:
這篇論文沒有講清楚如何更新算法2中的 $γ\gamma$ 和 $β\beta$
Algorithm2調用了Algorithm1
Algorithm2的思路與上面的神經網絡結構圖對應

######################################################################
最后是應付面試用的，BN啥好處？摘自[8]:
①不僅僅極大提升了訓練速度，收斂過程大大加快；
②還能增加分類效果，一種解釋是這是類似于Dropout的一種防止過擬合的正則化表達方式，所以不用Dropout也能達到相當的效果；
③另外調參過程也簡單多了，對于初始化要求沒那么高，而且可以使用大的學習率等。
######################################################################

Reference:
[1]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift-Sergey Ioffe and Christian Szegedy
[2]All you need is a good init-Dmytro Mishkin and Jiri Matas
[3]深度學習入門-基于Python的理論與實現
[4]Understanding the backward pass through Batch Normalization Layer
[5]Efficient bckprop-Yann A. LeCun，Leon Bottou，Genevieve B. Orr…
[6]A convergence analysis of log-linear training-Wiesler,Simon and Ney,Hermann
[7]深度學習中 Batch Normalization為什么效果好？
[8]【深度學習】深入理解Batch Normalization批標準化

總結

以上是生活随笔為你收集整理的Batch Normalization原文详细解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：第六章勘误以及Normalization
下一篇： 37 Reasons why your