Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读
Paper:Xavier參數(shù)初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻譯與解讀
?
?
目錄
Understanding the difficulty of training deep feedforward neural networks
Abstract
5 Error Curves and Conclusions??誤差曲線及結(jié)論
?
相關(guān)文章
Paper:Xavier參數(shù)初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻譯與解讀
Paper:He參數(shù)初始化之《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet C》的翻譯與解讀
DL之DNN優(yōu)化技術(shù):DNN中參數(shù)初始化【Lecun參數(shù)初始化、He參數(shù)初始化和Xavier參數(shù)初始化】的簡(jiǎn)介、使用方法詳細(xì)攻略
Understanding the difficulty of training deep feedforward neural networks
原論文地址:http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi
http://proceedings.mlr.press/v9/glorot10a.html
作者:Xavier Glorot Yoshua Bengio DIRO, Universite de Montr ′ eal, Montr ′ eal, Qu ′ ebec, Canada
引用 格式:[1] Xavier Glorot, Yoshua Bengio ;?Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
?
Abstract
| Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. | 在2006年之前,深層多層神經(jīng)網(wǎng)絡(luò)似乎并沒(méi)有被成功地訓(xùn)練,從那時(shí)起,一些算法已經(jīng)被證明能夠成功地訓(xùn)練它們,實(shí)驗(yàn)結(jié)果顯示了深層結(jié)構(gòu)與非深層結(jié)構(gòu)的優(yōu)越性。所有這些實(shí)驗(yàn)結(jié)果都是使用新的初始化或訓(xùn)練機(jī)制得到的。我們的目標(biāo)是更好地理解為什么深度神經(jīng)網(wǎng)絡(luò)在隨機(jī)初始化的標(biāo)準(zhǔn)梯度下降中表現(xiàn)如此糟糕,更好地理解這些最近的相對(duì)成功,并在未來(lái)幫助設(shè)計(jì)更好的算法。我們首先觀察非線性激活函數(shù)的影響。我們發(fā)現(xiàn)邏輯s型激活不適合具有隨機(jī)初始化的深度網(wǎng)絡(luò),因?yàn)?span style="color:#f33b45;">它的均值會(huì)使深度網(wǎng)絡(luò)特別是最頂層的隱藏層達(dá)到飽和。令人驚訝的是,我們發(fā)現(xiàn)飽和的單位可以自己走出飽和狀態(tài),盡管速度很慢,這也解釋了為什么在訓(xùn)練神經(jīng)網(wǎng)絡(luò)時(shí)有時(shí)會(huì)出現(xiàn)停滯狀態(tài)。我們發(fā)現(xiàn),一個(gè)新的非線性,飽和少往往是有益的。最后,我們研究了激活度和梯度在層間和訓(xùn)練過(guò)程中的變化,認(rèn)為當(dāng)與每一層相關(guān)的雅可比矩陣的奇異值遠(yuǎn)離1時(shí),訓(xùn)練可能會(huì)更加困難。基于這些考慮,我們提出了一種新的初始化方案,該方案大大加快了收斂速度。 |
?
5 Error Curves and Conclusions??誤差曲線及結(jié)論
| The final consideration that we care for is the success of training with different strategies, and this is best illustrated with error curves showing the evolution of test error as training progresses and asymptotes. Figure 11 shows such curves with online training on Shapeset-3 × 2, while Table 1 gives final test error for all the datasets studied (Shapeset-3 × 2, MNIST, CIFAR-10, and SmallImageNet). As a baseline, we optimized RBF SVM models on one hundred thousand Shapeset examples and obtained 59.47% test error, while on the same set we obtained 50.47% with a depth five hyperbolic tangent network with normalized initialization. | 我們關(guān)心的最后一個(gè)問(wèn)題是不同策略下的訓(xùn)練是否成功,這可以用誤差曲線來(lái)最好地說(shuō)明,誤差曲線顯示了測(cè)試誤差隨訓(xùn)練的進(jìn)展和漸近線的演變。圖11顯示了對(duì)Shapeset-3×2進(jìn)行在線訓(xùn)練后的曲線,表1給出了所有研究數(shù)據(jù)集的最終測(cè)試誤差(Shapeset-3×2、MNIST、ci- 10和SmallImageNet)。作為基線,我們對(duì)10萬(wàn)個(gè)Shapeset樣本的RBF SVM模型進(jìn)行了優(yōu)化,得到了59.47%的測(cè)試誤差,而在同一組樣本中,我們得到了50.47%的深度5雙曲正切網(wǎng)絡(luò),并進(jìn)行了歸一化初始化。 |
| Figure 8: Weight gradient normalized histograms with hyperbolic tangent activation just after initialization, with standard initialization (top) and normalized initialization (bottom), for different layers. Even though with standard initialization the back-propagated gradients get smaller, the weight gradients do not! 圖8:不同層的權(quán)重梯度歸一化直方圖,初始化后使用雙曲正切激活,標(biāo)準(zhǔn)初始化(頂部)和歸一化初始化(底部)。即使使用標(biāo)準(zhǔn)的初始化,反向傳播的梯度也會(huì)變小,但是權(quán)值梯度不會(huì)變小! ? ? Figure 9: Standard deviation intervals of the weights gradients with hyperbolic tangents with standard initialization (top) and normalized (bottom) during training. We see that the normalization allows to keep the same variance of the weights gradient across layers, during training (top: smaller variance for higher layers). 圖9:訓(xùn)練過(guò)程中,帶有標(biāo)準(zhǔn)初始化的雙曲切線權(quán)值梯度(上)和歸一化(下)的標(biāo)準(zhǔn)差區(qū)間。我們可以看到,在訓(xùn)練過(guò)程中,規(guī)范化允許在不同層之間保持相同的權(quán)值梯度的方差(頂部:更高層的方差更小)。 ? Table 1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden layers. N after the activation function name indicates the use of normalized initialization. Results in bold are statistically different from non-bold ones under the null hypothesis test with p = 0.005. 表1:不同激活函數(shù)和初始化方案對(duì)5個(gè)隱含層的深度網(wǎng)絡(luò)的測(cè)試誤差。激活函數(shù)名后的N表示使用規(guī)范化初始化。在p = 0.005的原假設(shè)檢驗(yàn)下,粗體的結(jié)果與非粗體的結(jié)果有統(tǒng)計(jì)學(xué)差異。 ? ? ? Figure 10: 98 percentile (markers alone) and standard deviation (solid lines with markers) of the distribution of activation values for hyperbolic tangent with normalized initialization during learning. 圖10:學(xué)習(xí)過(guò)程中正切激活值分布的98個(gè)百分位(單獨(dú)標(biāo)記)和標(biāo)準(zhǔn)差(帶有標(biāo)記的實(shí)線)。 | |
| These results illustrate the effect of the choice of activation and initialization. As a reference we include in Figure 11 the error curve for the supervised fine-tuning from the initialization obtained after unsupervised pre-training with denoising auto-encoders (Vincent et al., 2008). For each network the learning rate is separately chosen to minimize error on the validation set. We can remark that on Shapeset-3 × 2, because of the task difficulty, we observe important saturations during learning, this might explain that the normalized initialization or the softsign effects are more visible | 這些結(jié)果說(shuō)明了激活和初始化選擇的影響。作為參考,我們?cè)趫D11中包括了經(jīng)過(guò)去噪自動(dòng)編碼器的無(wú)監(jiān)督預(yù)訓(xùn)練后獲得的初始化的監(jiān)督微調(diào)的誤差曲線(Vincent et al., 2008)。對(duì)于每個(gè)網(wǎng)絡(luò),我們分別選擇學(xué)習(xí)速率來(lái)最小化驗(yàn)證集上的錯(cuò)誤。我們可以注意到,在Shapeset-3×2上,由于任務(wù)的難度,我們觀察到了學(xué)習(xí)過(guò)程中的重要飽和,這可能解釋了規(guī)范化初始化或軟標(biāo)記效應(yīng)更明顯 |
| Several conclusions can be drawn from these error curves:
Others methods can alleviate discrepancies between layers during learning, e.g., exploiting second order information to set the learning rate separately for each parameter. For example, we can exploit the diagonal of the Hessian (LeCun et al., 1998b) or a gradient variance estimate. Both those methods have been applied for Shapeset-3 × 2 with hyperbolic tangent and standard initialization. We observed a gain in performance but not reaching the result obtained from normalized initialization. In addition, we observed further gains by combining normalized initialization with second order methods: the estimated Hessian might then focus on discrepancies between units, not having to correct important initial discrepancies between layers. | 從這些誤差曲線可以得出以下幾點(diǎn)結(jié)論:
|
| ? | Figure 11: Test error during online training on the Shapeset-3×2 dataset, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 圖11:在Shapeset-3×2數(shù)據(jù)集上進(jìn)行在線訓(xùn)練時(shí),各種激活函數(shù)和初始化方案的測(cè)試誤差(為了減少最終誤差,從上到下排序)。激活函數(shù)名后的N表示使用規(guī)范化初始化。 ? ? Figure 12: Test error curves during training on MNIST and CIFAR10, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 圖12:MNIST和CIFAR10上的各種激活函數(shù)和初始化方案的訓(xùn)練誤差曲線(從上到下排序以減少最終誤差)。激活函數(shù)名后的N表示使用規(guī)范化初始化。 |
In all reported experiments we have used the same number of units per layer. However, we verified that we obtain the same gains when the layer size increases (or decreases) with layer number. The other conclusions from this study are the following:
| 在所有報(bào)告的實(shí)驗(yàn)中,我們使用了相同數(shù)量的單位每層。但是,我們驗(yàn)證了隨著層數(shù)的增加(或減少),我們獲得了相同的收益。本研究的其他結(jié)論如下:
|
?
?
?
?
總結(jié)
以上是生活随笔為你收集整理的Paper:Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: DL之DNN优化技术:采用三种激活函数(
- 下一篇: DL之LSTM:tf.contrib.r