當(dāng)前位置：首頁(yè) >

Paper：Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读

發(fā)布時(shí)間：2025/3/21 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Paper：Xavier參數(shù)初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻譯與解讀

Understanding the difficulty of training deep feedforward neural networks

Abstract

5 Error Curves and Conclusions??誤差曲線及結(jié)論

相關(guān)文章
Paper：Xavier參數(shù)初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻譯與解讀
Paper：He參數(shù)初始化之《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet C》的翻譯與解讀
DL之DNN優(yōu)化技術(shù)：DNN中參數(shù)初始化【Lecun參數(shù)初始化、He參數(shù)初始化和Xavier參數(shù)初始化】的簡(jiǎn)介、使用方法詳細(xì)攻略

Understanding the difficulty of training deep feedforward neural networks

原論文地址：http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi
http://proceedings.mlr.press/v9/glorot10a.html
作者：Xavier Glorot Yoshua Bengio DIRO, Universite de Montr ′ eal, Montr ′ eal, Qu ′ ebec, Canada
引用格式：[1] Xavier Glorot, Yoshua Bengio ;?Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.

Abstract

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.

在2006年之前，深層多層神經(jīng)網(wǎng)絡(luò)似乎并沒(méi)有被成功地訓(xùn)練，從那時(shí)起，一些算法已經(jīng)被證明能夠成功地訓(xùn)練它們，實(shí)驗(yàn)結(jié)果顯示了深層結(jié)構(gòu)與非深層結(jié)構(gòu)的優(yōu)越性。所有這些實(shí)驗(yàn)結(jié)果都是使用新的初始化或訓(xùn)練機(jī)制得到的。我們的目標(biāo)是更好地理解為什么深度神經(jīng)網(wǎng)絡(luò)在隨機(jī)初始化的標(biāo)準(zhǔn)梯度下降中表現(xiàn)如此糟糕，更好地理解這些最近的相對(duì)成功，并在未來(lái)幫助設(shè)計(jì)更好的算法。我們首先觀察非線性激活函數(shù)的影響。我們發(fā)現(xiàn)邏輯s型激活不適合具有隨機(jī)初始化的深度網(wǎng)絡(luò)，因?yàn)?span style="color:#f33b45;">它的均值會(huì)使深度網(wǎng)絡(luò)特別是最頂層的隱藏層達(dá)到飽和。令人驚訝的是，我們發(fā)現(xiàn)飽和的單位可以自己走出飽和狀態(tài)，盡管速度很慢，這也解釋了為什么在訓(xùn)練神經(jīng)網(wǎng)絡(luò)時(shí)有時(shí)會(huì)出現(xiàn)停滯狀態(tài)。我們發(fā)現(xiàn)，一個(gè)新的非線性，飽和少往往是有益的。最后，我們研究了激活度和梯度在層間和訓(xùn)練過(guò)程中的變化，認(rèn)為當(dāng)與每一層相關(guān)的雅可比矩陣的奇異值遠(yuǎn)離1時(shí)，訓(xùn)練可能會(huì)更加困難。基于這些考慮，我們提出了一種新的初始化方案，該方案大大加快了收斂速度。

5 Error Curves and Conclusions??誤差曲線及結(jié)論

The final consideration that we care for is the success of training with different strategies, and this is best illustrated with error curves showing the evolution of test error as training progresses and asymptotes. Figure 11 shows such curves with online training on Shapeset-3 × 2, while Table 1 gives final test error for all the datasets studied (Shapeset-3 × 2, MNIST, CIFAR-10, and SmallImageNet). As a baseline, we optimized RBF SVM models on one hundred thousand Shapeset examples and obtained 59.47% test error, while on the same set we obtained 50.47% with a depth five hyperbolic tangent network with normalized initialization.	我們關(guān)心的最后一個(gè)問(wèn)題是不同策略下的訓(xùn)練是否成功，這可以用誤差曲線來(lái)最好地說(shuō)明，誤差曲線顯示了測(cè)試誤差隨訓(xùn)練的進(jìn)展和漸近線的演變。圖11顯示了對(duì)Shapeset-3×2進(jìn)行在線訓(xùn)練后的曲線，表1給出了所有研究數(shù)據(jù)集的最終測(cè)試誤差(Shapeset-3×2、MNIST、ci- 10和SmallImageNet)。作為基線，我們對(duì)10萬(wàn)個(gè)Shapeset樣本的RBF SVM模型進(jìn)行了優(yōu)化，得到了59.47%的測(cè)試誤差，而在同一組樣本中，我們得到了50.47%的深度5雙曲正切網(wǎng)絡(luò)，并進(jìn)行了歸一化初始化。
	Figure 8: Weight gradient normalized histograms with hyperbolic tangent activation just after initialization, with standard initialization (top) and normalized initialization (bottom), for different layers. Even though with standard initialization the back-propagated gradients get smaller, the weight gradients do not! 圖8:不同層的權(quán)重梯度歸一化直方圖，初始化后使用雙曲正切激活，標(biāo)準(zhǔn)初始化(頂部)和歸一化初始化(底部)。即使使用標(biāo)準(zhǔn)的初始化，反向傳播的梯度也會(huì)變小，但是權(quán)值梯度不會(huì)變小! ? ? Figure 9: Standard deviation intervals of the weights gradients with hyperbolic tangents with standard initialization (top) and normalized (bottom) during training. We see that the normalization allows to keep the same variance of the weights gradient across layers, during training (top: smaller variance for higher layers). 圖9:訓(xùn)練過(guò)程中，帶有標(biāo)準(zhǔn)初始化的雙曲切線權(quán)值梯度(上)和歸一化(下)的標(biāo)準(zhǔn)差區(qū)間。我們可以看到，在訓(xùn)練過(guò)程中，規(guī)范化允許在不同層之間保持相同的權(quán)值梯度的方差(頂部:更高層的方差更小)。 ? Table 1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden layers. N after the activation function name indicates the use of normalized initialization. Results in bold are statistically different from non-bold ones under the null hypothesis test with p = 0.005. 表1:不同激活函數(shù)和初始化方案對(duì)5個(gè)隱含層的深度網(wǎng)絡(luò)的測(cè)試誤差。激活函數(shù)名后的N表示使用規(guī)范化初始化。在p = 0.005的原假設(shè)檢驗(yàn)下，粗體的結(jié)果與非粗體的結(jié)果有統(tǒng)計(jì)學(xué)差異。 ? ? ? Figure 10: 98 percentile (markers alone) and standard deviation (solid lines with markers) of the distribution of activation values for hyperbolic tangent with normalized initialization during learning. 圖10:學(xué)習(xí)過(guò)程中正切激活值分布的98個(gè)百分位(單獨(dú)標(biāo)記)和標(biāo)準(zhǔn)差(帶有標(biāo)記的實(shí)線)。
These results illustrate the effect of the choice of activation and initialization. As a reference we include in Figure 11 the error curve for the supervised fine-tuning from the initialization obtained after unsupervised pre-training with denoising auto-encoders (Vincent et al., 2008). For each network the learning rate is separately chosen to minimize error on the validation set. We can remark that on Shapeset-3 × 2, because of the task difficulty, we observe important saturations during learning, this might explain that the normalized initialization or the softsign effects are more visible	這些結(jié)果說(shuō)明了激活和初始化選擇的影響。作為參考，我們?cè)趫D11中包括了經(jīng)過(guò)去噪自動(dòng)編碼器的無(wú)監(jiān)督預(yù)訓(xùn)練后獲得的初始化的監(jiān)督微調(diào)的誤差曲線(Vincent et al.， 2008)。對(duì)于每個(gè)網(wǎng)絡(luò)，我們分別選擇學(xué)習(xí)速率來(lái)最小化驗(yàn)證集上的錯(cuò)誤。我們可以注意到，在Shapeset-3×2上，由于任務(wù)的難度，我們觀察到了學(xué)習(xí)過(guò)程中的重要飽和，這可能解釋了規(guī)范化初始化或軟標(biāo)記效應(yīng)更明顯
Several conclusions can be drawn from these error curves: ? The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards ultimately poorer local minima. ? The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity. ? For tanh networks, the proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of? activations (flowing upward) and gradients (flowing backward). Others methods can alleviate discrepancies between layers during learning, e.g., exploiting second order information to set the learning rate separately for each parameter. For example, we can exploit the diagonal of the Hessian (LeCun et al., 1998b) or a gradient variance estimate. Both those methods have been applied for Shapeset-3 × 2 with hyperbolic tangent and standard initialization. We observed a gain in performance but not reaching the result obtained from normalized initialization. In addition, we observed further gains by combining normalized initialization with second order methods: the estimated Hessian might then focus on discrepancies between units, not having to correct important initial discrepancies between layers.	從這些誤差曲線可以得出以下幾點(diǎn)結(jié)論: ?具有s形或雙曲正切單元和標(biāo)準(zhǔn)初始化的更經(jīng)典的神經(jīng)網(wǎng)絡(luò)表現(xiàn)相當(dāng)差，收斂速度更慢，而且顯然最終會(huì)導(dǎo)致更差的局部極小值。 ?軟信號(hào)網(wǎng)絡(luò)在初始化過(guò)程中似乎比tanh網(wǎng)絡(luò)更健壯，可能是因?yàn)樗鼈兊姆蔷€性更溫和。 ?對(duì)于tanh網(wǎng)絡(luò)，建議的規(guī)范化初始化可能非常有幫助，可能是因?yàn)閷拥綄拥霓D(zhuǎn)換保持激活量(向上流動(dòng))和梯度(向后流動(dòng))。其他方法可以緩解學(xué)習(xí)過(guò)程中各層之間的差異，如利用二階信息分別設(shè)置各參數(shù)的學(xué)習(xí)速率。例如，我們可以利用Hessian (LeCun et al.， 1998b)或梯度方差估計(jì)的對(duì)角線。這兩種方法都應(yīng)用于雙曲正切和標(biāo)準(zhǔn)初始化的3×2形狀。我們觀察到性能上的提高，但沒(méi)有達(dá)到規(guī)范化初始化得到的結(jié)果。此外，通過(guò)將規(guī)范化初始化與二階方法相結(jié)合，我們還觀察到了進(jìn)一步的收獲:估計(jì)的Hessian可能會(huì)關(guān)注單元之間的差異，而不必糾正層之間重要的初始差異。
?	Figure 11: Test error during online training on the Shapeset-3×2 dataset, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 圖11:在Shapeset-3×2數(shù)據(jù)集上進(jìn)行在線訓(xùn)練時(shí)，各種激活函數(shù)和初始化方案的測(cè)試誤差(為了減少最終誤差，從上到下排序)。激活函數(shù)名后的N表示使用規(guī)范化初始化。 ? ? Figure 12: Test error curves during training on MNIST and CIFAR10, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 圖12:MNIST和CIFAR10上的各種激活函數(shù)和初始化方案的訓(xùn)練誤差曲線(從上到下排序以減少最終誤差)。激活函數(shù)名后的N表示使用規(guī)范化初始化。
In all reported experiments we have used the same number of units per layer. However, we verified that we obtain the same gains when the layer size increases (or decreases) with layer number. The other conclusions from this study are the following: ? Monitoring activations and gradients across layers and?training iterations is a powerful investigative tool for understanding training difficulties in deep nets. ? Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer. ? Keeping the layer-to-layer transformations such that both activations and gradients flow well (i.e. with a Jacobian around 1) appears helpful, and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning. ? Many of our observations remain unexplained, suggesting further investigations to better understand gradients and training dynamics in deep architectures.	在所有報(bào)告的實(shí)驗(yàn)中，我們使用了相同數(shù)量的單位每層。但是，我們驗(yàn)證了隨著層數(shù)的增加(或減少)，我們獲得了相同的收益。本研究的其他結(jié)論如下: ?跨層監(jiān)控激活和梯度以及訓(xùn)練迭代是理解深層網(wǎng)絡(luò)中訓(xùn)練困難的一個(gè)強(qiáng)大的調(diào)查工具。 ?從較小的隨機(jī)權(quán)值初始化時(shí)應(yīng)避免Sigmoid激活(不是圍繞0對(duì)稱(chēng)的)，因?yàn)樗鼈儠?huì)產(chǎn)生較差的學(xué)習(xí)動(dòng)態(tài)，且初始隱藏層已飽和。 ?保持分層之間的轉(zhuǎn)換，使激活和梯度都能很好地流動(dòng)(即雅可比矩陣在1附近)，這看起來(lái)很有幫助，并允許消除純監(jiān)督深度網(wǎng)絡(luò)和非監(jiān)督學(xué)習(xí)的預(yù)訓(xùn)練網(wǎng)絡(luò)之間的很大一部分差異。 ?我們的許多觀察結(jié)果仍然無(wú)法解釋，這意味著需要進(jìn)一步的研究來(lái)更好地理解深層架構(gòu)中的梯度和訓(xùn)練動(dòng)態(tài)。

總結(jié)

以上是生活随笔為你收集整理的Paper：Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： DL之DNN优化技术：采用三种激活函数(
下一篇： DL之LSTM：tf.contrib.r

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

Paper：Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读

Understanding the difficulty of training deep feedforward neural networks

Abstract

5 Error Curves and Conclusions??誤差曲線及結(jié)論

總結(jié)