當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

神经网络梯度下降_梯度下降优化器对神经网络训练的影响

發布時間：2023/12/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了神经网络梯度下降_梯度下降优化器对神经网络训练的影响小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

神經網絡梯度下降

co-authored with Apurva Pathak

與Apurva Pathak合著

嘗試梯度下降優化器 (Experimenting with Gradient Descent Optimizers)

Welcome to another instalment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

歡迎來到我們的深度學習實驗系列的另一部分，我們在其中進行實驗以評估關于訓練神經網絡的普遍假設。我們的目標是更好地了解影響模型訓練和評估的不同設計選擇。為此，我們提出有關每個設計選擇的問題，然后進行實驗以回答這些問題。

In this article, we seek to better understand the impact of using different optimizers:

在本文中，我們試圖更好地理解使用不同的優化器的影響：

How do different optimizers perform in practice?
在實踐中不同的優化器如何執行？
How sensitive is each optimizer to parameter choices such as learning rate or momentum?
每個優化器對諸如學習率或動量之類的參數選擇有多敏感？
How quickly does each optimizer converge?
每個優化程序收斂的速度有多快？
How much of a performance difference does choosing a good optimizer make?
選擇一個好的優化器會對性能產生多大的影響？

To answer these questions, we evaluate the following optimizers:

為了回答這些問題，我們評估以下優化器：

Stochastic gradient descent (SGD)
隨機梯度下降(SGD)
SGD with momentum
新元勢頭強勁
SGD with Nesterov momentum
內斯托羅夫勢頭強勁的SGD
RMSprop
RMSprop
Adam
亞當
Adagrad
阿達格勒
Cyclic Learning Rate
循環學習率

實驗如何設置？ (How are the experiments set up?)

We train a neural net using different optimizers and compare their performance. The code for these experiments can be found on Github.

我們使用不同的優化器訓練神經網絡并比較其性能。這些實驗的代碼可以在Github上找到。

Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.
數據集：我們使用“貓狗”數據集，該數據集由23,262張貓和狗的圖像組成，在這兩個類別之間劃分為大約50/50。由于圖像的尺寸不同，因此我們將它們調整為相同的尺寸。我們使用數據集的20％作為驗證數據(開發集)，其余作為訓練數據。
Evaluation metric: we use the binary cross-entropy loss on the validation data as our primary metric to measure model performance.
評估指標：我們使用驗證數據上的二進制交叉熵損失作為衡量模型性能的主要指標。

Figure 1: Sample images from Cats and Dogs dataset圖1：貓和狗數據集的樣本圖像

Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool -> ReLU -> batch-norm -> dropout) operations repeatedly. Then, we flatten the output volume and feed it into two fully-connected layers (dense -> ReLU -> batch-norm) with 256 units each, and dropout after the first FC layer. Finally, we feed the result into a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).
基本模型：我們還定義了一個受VGG16啟發的基本模型，我們在其中重復應用(卷積->最大池-> ReLU->批處理范數->退出)操作。然后，我們將輸出量展平，并將其饋入兩個完全連接的層(密集-> ReLU->批處理規范)，每個層具有256個單位，并在第一個FC層之后退出。最后，我們將結果輸入到具有S形激活的單神經元層中，從而得到0到1之間的輸出，告訴我們該模型預測的是貓(0)還是狗(1)。

NN SVG)NN SVG創建)

Training: we use a batch size of 32 and the default weight initialization (Glorot uniform). The default optimizer is SGD with a learning rate of 0.01. We train until the validation loss fails to improve over 50 iterations.
培訓：我們使用32批次大小和默認的重量初始化(Glorot統一)。默認優化器為SGD，學習率為0.01。我們進行訓練，直到驗證損失無法改善超過50次迭代為止。

隨機梯度下降 (Stochastic Gradient Descent)

We first start off with vanilla stochastic gradient descent. This is defined by the following update equation:

我們首先從香草隨機梯度下降開始。這由以下更新方程式定義：

Figure 3: SGD update equation圖3：SGD更新公式

where w is the weight vector and dw is the gradient of the loss function with respect to the weights. This update rule takes a step in the direction of greatest decrease in the loss function, helping us find a set of weights that minimizes the loss. Note that in pure SGD, the update is applied per example, but more commonly it is computed on a batch of examples (called a mini-batch).

其中w是權重向量，dw是損失函數相對于權重的梯度。此更新規則朝著損失函數最大減少的方向邁出了一步，從而幫助我們找到了使損失最小化的一組權重。請注意，在純SGD中，每個示例都會應用更新，但更常見的是，它是基于一批示例(稱為迷你批處理)計算得出的。

學習率如何影響SGD？ (How does learning rate affect SGD?)

First, we explore how learning rate affects SGD. It is well known that choosing a learning rate that is too low will cause the model to converge slowly, whereas a learning rate that is too high may cause it to not converge at all.

首先，我們探討學習率如何影響SGD。眾所周知，選擇過低的學習速率會使模型收斂緩慢，而過高的學習速率可能會使模型完全收斂。

Jeremy Jordan’s websiteJeremy Jordan的網站

To verify this experimentally, we vary the learning rate along a log scale between 0.001 and 0.1. Let’s first plot the training losses.

為了通過實驗驗證這一點，我們沿0.001至0.1的對數刻度更改了學習率。讓我們首先繪制訓練損失。

Figure 5: Training loss curves for SGD with different learning rates圖5：具有不同學習率的SGD的訓練損失曲線

We indeed observe that performance is optimal when the learning rate is neither too small nor too large (the red line). Initially, increasing the learning rate speeds up convergence, but after learning rate 0.0316, convergence actually becomes slower. This may be because taking a larger step may actually overshoot the minimum loss, as illustrated in figure 4, resulting in a higher loss.

我們確實觀察到，當學習率既不太小也不太大(紅線)時，性能是最佳的。最初，提高學習速率會加快收斂速度??，但是在學習速率達到0.0316之后，收斂實際上會變慢。這可能是因為采取更大的步驟實際上可能會超出最小損耗，如圖4所示，從而導致更高的損耗。

Let’s now plot the validation losses.

現在讓我們繪制驗證損失。

Figure 6: Validation loss curves for SGD with different learning rates圖6：具有不同學習率的SGD的驗證損失曲線

We observe that validation performance suffers when we pick a learning rate that is either too small or too big. Too small (e.g. 0.001) and the validation loss does not decrease at all, or does so very slowly. Too large (e.g. 0.1) and the validation loss does not attain as low a minimum as it could with a smaller learning rate.

我們觀察到，當選擇的學習率太小或太大時，驗證性能都會受到影響。太小(例如0.001)，驗證損失根本不會減少，或者會非常緩慢地減少。太大(例如0.1)，并且驗證損失無法達到學習率較小時的最小值。

Let’s now plot the best training and validation loss attained by each learning rate*:

現在，讓我們來繪制每種學習率*可獲得的最佳培訓和驗證損失：

Figure 7: Minimum training and validation losses for SGD at different learning rates圖7：不同學習率下SGD的最小培訓和驗證損失

The data above confirm the ‘Goldilocks’ theory of picking a learning rate that is neither too small nor too large, since the best learning rate (3.2e-2) is in the middle of the range of values we tried.

上面的數據證實了“ Goldilocks”理論選擇的學習率既不能太小也不能太大，因為最佳學習率(3.2e-2)處于我們嘗試的值范圍的中間。

*Typically, we would expect the validation loss to be higher than the training loss, since the model has not seen the validation data before. However, we see above that the validation loss is surprisingly sometimes lower than the training loss. This could be due to dropout, since neurons are dropped only at training time and not during evaluation, resulting in better performance during evaluation than during training. The effect may be particularly pronounced when the dropout rate is high, as it is in our model (0.6 dropout on FC layers).

*通常，由于模型之前沒有看到驗證數據，因此我們希望驗證損失高于訓練損失。但是，我們在上面看到，驗證損失有時會比訓練損失低。這可能是由于輟學造成的，因為神經元僅在訓練時而不是在評估過程中被丟棄，從而導致評估期間的性能比訓練期間更好。當輟學率很高時，效果會特別明顯，就像我們的模型一樣(FC層上的輟學率為0.6)。

最佳SGD驗證損失 (Best SGD validation loss)

Best validation loss: 0.1899
最佳驗證損失：0.1899
Associated training loss: 0.1945
相關的訓練損失：0.1945
Epochs to converge to minimum: 535
收斂到最少的紀元：535
Params: learning rate 0.032
參數：學習率0.032

SGD外賣 (SGD takeaways)

Choosing a good learning rate (not too big, not too small) is critical for ensuring optimal performance on SGD.
選擇一個好的學習率(不要太大，不要太小)對于確保SGD的最佳性能至關重要。

動量的隨機梯度下降 (Stochastic Gradient Descent with Momentum)

總覽 (Overview)

SGD with momentum is a variant of SGD that typically converges more quickly than vanilla SGD. It is typically defined as follows:

具有動量的SGD是SGD的變體，通常比原始SGD收斂更快。通常定義如下：

Figure 8: Update equations for SGD with momentum圖8：具有動量的SGD更新公式

Deep Learning by Goodfellow et al. explains the physical intuition behind the algorithm [0]:

Goodfellow等人的深度學習。解釋了算法[0]背后的物理直覺：

Formally, the momentum algorithm introduces a variable v that plays the role of velocity — it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient.

形式上，動量算法引入了一個變量v ，它起著速度的作用-它是參數在參數空間中移動的方向和速度。速度設置為負梯度的指數衰減平均值。

In other words, the parameters move through the parameter space at a velocity that changes over time. The change in velocity is dictated by two terms:

換句話說，參數以隨時間變化的速度在參數空間中移動。速度的變化由兩個術語決定：

𝛼, the learning rate, which determines to what degree the gradient acts upon the velocity
𝛼，學習率，它決定梯度對速度的作用程度
𝛽, the rate at which the velocity decays over time
𝛽，速度隨時間衰減的速率

Thus, the velocity is an exponential average of the gradients, which incorporates new gradients and naturally decays old gradients over time.

因此，速度是梯度的指數平均值，其中合并了新的梯度并隨著時間自然衰減了舊的梯度。

One can imagine a ball rolling down a hill, gathering velocity as it descends. Gravity exerts force on the ball, causing it to accelerate or decelerate, as represented by the gradient term 𝛼 * dw. The ball also encounters viscous drag, causing its velocity to decay, as represented by 𝛽.

可以想象一個球從山上滾下來，隨著球下降而加速。重力在球上施加力，使球加速或減速，如梯度項𝛼 * dw所示。球還遇到粘性阻力，從而導致其速度衰減(以represented表示)。

One effect of momentum is to accelerate updates along dimensions where the gradient direction is consistent. For example, consider the effect of momentum when the gradient is a constant c:

動量的作用之一是沿梯度方向一致的維度加速更新。例如，考慮梯度為常數c時動量的影響：

Figure 9: change in velocity over time when gradient is a constant c.圖9：當梯度為常數c時，速度隨時間的變化。

Whereas vanilla SGD would make an update of -𝛼c each time, SGD with momentum would accelerate over time, eventually reaching a terminal velocity that is 1/1-𝛽 times greater than the vanilla update (derived using the formula for an infinite series). For example, if we set the momentum to 𝛽=0.9, then the update eventually becomes 10 times as large as the vanilla update.

而香草SGD會使的更新- αc各自時間，SGD與動量會加快隨著時間的推移，最終達到一個終極速度比所述香草更新一分之一-β倍的情況下(使用公式為一個無窮級數導出) 。例如，如果將動量設置為𝛽 = 0.9，則更新最終將變為原始更新的10倍。

Another effect of momentum is that it dampens oscillations. For example, consider a case when the gradient zigzags and changes direction often along a certain dimension:

動量的另一個作用是抑制動量。例如，考慮一種情況，其中漸變之字形并經常沿某個維度改變方向：

More on Optimization Techniques by Ekaba Bisong優化技術》。

The momentum term dampens the oscillations because the oscillating terms cancel out when we add them into the velocity. This allows the update to be dominated by dimensions where the gradient points consistently in the same direction.

動量項抑制了振蕩，因為當我們將它們添加到速度中時，振蕩項會抵消。這使得更新可以由梯度始終指向同一方向的尺寸決定。

實驗 (Experiments)

Let’s look at the effect of momentum at learning rate 0.01. We try out momentum values [0, 0.5, 0.9, 0.95, 0.99].

讓我們看一下學習速率為0.01時動量的影響。我們嘗試了動量值[0，0.5，0.9，0.95，0.99]。

Figure 11: Effect of momentum on training loss (left) and validation (right) at learning rate 0.01.圖11：動量對學習率0.01時訓練損失(左)和驗證(右)的影響。

Above, we can see that increasing momentum up to 0.9 helps model training converge more quickly, since training and validation loss decrease at a faster rate. However, once we go past 0.9, we observe that training loss and validation loss actually suffer, with model training entirely failing to converge at momentum 0.99. Why does this happen? This could be because excessively large momentum prevents the model from adapting to new directions in the gradient updates. Another potential reason is that the weight updates become so large that it overshoots the minima. However, this remains an area for future investigation.

上圖，我們可以看到將動量增加到0.9有助于模型訓練更快收斂，因為訓練和驗證損失減少的速度更快。但是，一旦超過0.9，我們就會發現訓練損失和驗證損失實際上受到了影響，模型訓練完全無法收斂于動量0.99。為什么會這樣？這可能是因為過大的動量會阻止模型在梯度更新中適應新的方向。另一個潛在原因是權重更新變得如此之大，以至于超過了最小值。但是，這仍然是未來研究的領域。

Do we observe the decrease in oscillation that is touted as a benefit of momentum? To measure this, we can compute an oscillation proportion for each update step — i.e. what proportion of parameter updates in the current update have the opposite sign compared to the previous update. Indeed, increasing the momentum decreases the proportion of parameters that oscillate:

我們是否觀察到被吹捧為動量的優勢而減少了振蕩？為了衡量這一點，我們可以為每個更新步驟計算一個振蕩比例，即當前更新中參數更新與先前更新相比具有相反符號的比例。的確，增加動量會降低振蕩參數的比例：

Figure 12: Effect of momentum on oscillation圖12：動量對振蕩的影響

What about the size of the updates — does the acceleration property of momentum increase the average size of the updates? Interestingly, the higher the momentum, the larger the initial updates but the smaller the later updates:

更新的大小如何-動量的加速屬性會增加更新的平均大小嗎？有趣的是，動量越高，初始更新越大，但后來更新越小：

Figure 13: Effect of momentum on average update size圖13：動量對平均更新大小的影響

Thus, increasing the momentum results in taking larger initial steps but smaller later steps. Why would this be the case? This is likely because momentum initially benefits from acceleration, causing the initial steps to be larger. Later, the momentum causes oscillations to cancel out, which could make the later steps smaller.

因此，增加動量會導致采取較大的初始步驟，但采取較小的后續步驟。為什么會這樣呢？這可能是因為動量最初會從加速中受益，從而導致初始步幅變大。稍后，動量會抵消振蕩，這可能會使后續步驟變小。

One data point that supports this interpretation is the distance traversed per epoch (defined as the Euclidean distance between the weights at the beginning of the epoch and the weights at the end of the epoch). We see that even though larger momentum values take smaller later steps, they actually traverse more distance:

支持該解釋的一個數據點是每個歷元遍歷的距離(定義為歷元開始時權重與歷時結束時權重之間的歐幾里得距離)。我們看到，即使較大的動量值在后面走較小的步驟，它們實際上也會越過更多的距離：

Figure 14: Distance traversed per epoch for each momentum value.圖14：每個時期每個動量值經過的距離。

This indicates that even though increasing the momentum values causes the later update steps to become smaller, the distance traversed is actually greater because the steps are more efficient — they do not cancel each other out as often.

這表明，即使增加動量值會導致以后的更新步驟變小，但由于這些步驟效率更高，因此經過的距離實際上更大，因為它們之間不會互相抵消。

Now, let’s look at the effect of momentum on a small learning rate (0.001).

現在，讓我們看一下動量對小學習率(0.001)的影響。

Figure 15: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.001.圖15：動量對學習率0.001的訓練損失(左)和驗證損失(右)的影響。

Surprisingly, increasing momentum on small learning rates helps it converge, when it didn’t before! Now, let’s look at a large learning rate.

出乎意料的是，以前所未有的速度提高小學習率的勢頭有助于它收斂！現在，讓我們看一下大學習率。

Figure 16: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.1.圖16：學習速率為0.1時，動量對訓練損失(左)和驗證損失(右)的影響。

When the learning rate is large, increasing the momentum degrades performance, and can even result in the model failing to converge (see flat lines above corresponding to momentum 0.9 and 0.95).

當學習率很高時，增加動量會降低性能，甚至可能導致模型無法收斂(請參見上方的平線，對應于動量0.9和0.95)。

Now, to generalize our observations, let’s look at the minimum training loss and validation loss across all learning rates and momentums:

現在，為了概括我們的觀察，讓我們看一下所有學習率和動量下的最小訓練損失和驗證損失：

Figure 17: Minimum training loss (left) and validation loss (right) at different learning rates and momentums. Minimum value in each row is highlighted in green.圖17：不同學習率和動量下的最小訓練損失(左)和驗證損失(右)。每行的最小值以綠色突出顯示。

We see that the learning rate and the momentum are closely linked —the higher the learning rate, the lower the range of ‘acceptable’ momentum values (i.e. values that don’t cause the model training to diverge). Conversely, the higher the momentum, the lower the range of acceptable learning rates.

我們看到學習率和動量密切相關，學習率越高，“可接受”動量值(即不會引起模型訓練發散的值)的范圍越小。相反，動力越高，可接受的學習率范圍越小。

Altogether, the behavior across all the learning rates suggests that increasing momentum has an effect akin to increasing the learning rate. It helps smaller learning rates converge (Figure 14) but may cause larger ones to diverge (Figure 15). This makes sense if we consider the terminal velocity interpretation from Figure 9 — adding momentum can cause the updates to reach a terminal velocity much greater than than the vanilla updates themselves.

總的來說，所有學習率的行為都表明，增加動量的作用類似于提高學習率。它有助于較小的學習率收斂(圖14)，但可能導致較大的學習率發散(圖15)。如果我們考慮圖9中的終極速度解釋，這是有道理的-增加動量可以導致更新達到終極速度，其速度要比原始更新本身大得多。

Note, however, that this does not mean that increasing momentum is the same as increasing the learning rate — there are simply some similarities in terms of convergence/divergence behavior between increasing momentum and increasing the learning rate. More concretely, as we can see in Figures 12 and 13, momentum also decreases oscillations, and front-loads the large updates at the beginning of training — we would not observe the same behaviors if we simply increased the learning rate.

但是請注意，這并不意味著增加動量與增加學習率相同–在增加動量和增加學習率之間在收斂/發散行為方面僅存在一些相似之處。更具體地講，如我們在圖12和13中看到的那樣，動量還減少了振蕩，并且在訓練開始時就將較大的更新提前加載了—如果僅提高學習速度，我們將不會觀察到相同的行為。

動量的替代表達 (Alternative formulation of momentum)

There is another way to define momentum, expressed as follows:

還有另一種定義動量的方式，表示如下：

Figure 18: Alternative definition of momentum圖18：動量的替代定義

Andrew Ng uses this definition of momentum in his Deep Learning Specialization on Coursera. In this formulation, the velocity term is an exponentially moving average of the gradients, controlled by the parameter beta. The update is applied to the weights, with the size of the update controlled by the learning rate alpha. Note that this formulation is mathematically the same as the first formulation when expanded, except that all the terms are multiplied by 1-beta.

吳安德(Andrew Ng)在Coursera的深度學習專業課程中使用了這種動量定義。在此公式中，速度項是由參數β控制的梯度的指數移動平均值。將更新應用于權重，更新的大小由學習率alpha控制。請注意，此公式在擴展時與第一個公式在數學上相同，只是所有術語均乘以1-beta。

How does this formulation of momentum work in practice?

這種動量公式在實踐中如何起作用？

Figure 19: Effect of momentum (alternative formulation) on training loss (left) and validation loss (right)圖19：動量(替代公式)對訓練損失(左)和驗證損失(右)的影響

Surprisingly, using this alternative formulation, it looks like increasing the momentum actually slows down convergence!

出乎意料的是，使用這種替代公式，看起來增加勢頭實際上會減慢收斂速度！

Why would this be the case? This formulation of momentum, while dampening oscillations, does not enjoy the same benefit of acceleration that the other formulation does. If we consider a toy example where the gradient is always a constant c, we see that the velocity never accelerates:

為什么會這樣呢？這種動量公式在抑制振動的同時，沒有像其他公式那樣具有加速的好處。如果我們考慮一個玩具示例，其中梯度始終為c ，則我們看到速度永遠不會加速：

Figure 20: Change in velocity over time with repeated gradients of constant c圖20：重復的常數c梯度隨時間變化的速度

Indeed, Andrew Ng suggests that the main benefit of this formulation of momentum is not acceleration, but the fact that it dampens oscillations, allowing you to use a larger learning rate and therefore converge more quickly. Based on our experiments, increasing momentum by itself (in this formulation) without increasing the learning rate is not enough to guarantee faster convergence.

實際上，吳安德(Andrew Ng)表示，這種動量公式化的主要好處不是加速，而是抑制振蕩的事實，使您可以使用較大的學習率，因此可以更快地收斂。根據我們的實驗，僅靠增加動量(在此公式中)而不增加學習率不足以保證更快的收斂。

有動力的SGD最佳驗證損失 (Best validation loss on SGD with momentum)

Best validation loss: 0.2046
最佳驗證損失：0.2046
Associated training loss: 0.2252
相關的訓練損失：0.2252
Epochs to converge to minimum: 402
收斂到最小限度的時代：402
Params: learning rate 0.01, momentum 0.5
參數：學習率0.01，動量0.5

SGD帶動力外賣 (SGD with momentum takeaways)

Momentum causes model training to converge more quickly, but is not guaranteed to improve the final training or validation loss, based on the parameters we tested.
根據我們測試的參數，動量會使模型訓練收斂更快，但不能保證改善最終訓練或驗證損失。
The higher the learning rate, the lower the range of acceptable momentum values (ones where model training converges).
學習速率越高，可接受動量值(模型訓練收斂的動量值)的范圍越小。

Nesterov動量的隨機梯度下降 (Stochastic Gradient Descent with Nesterov Momentum)

One issue with momentum is that while the gradient always points in the direction of greatest loss decrease, the momentum may not. To correct for this, Nesterov momentum computes the gradient at a lookahead point (w + velocity) instead of w. This gives the gradient a chance to correct for the momentum term.

動量的一個問題是，雖然梯度始終指向最大損耗減小的方向，但動量可能不會。為了對此進行校正，涅斯捷羅夫動量計算的是先行點(w +速度)而不是w的梯度。這使梯度有機會校正動量項。

Figure 21: Nesterov update. Left: illustration. Right: equations.圖21：Nesterov更新。左：插圖。右：方程式。

To illustrate how Nesterov can help training converge more quickly, let’s look at a dummy example where the optimizer tries to descend a bowl-shaped loss surface, with the minimum at the center of the bowl.

為了說明Nesterov如何幫助更快地訓練收斂，讓我們看一個虛擬的示例，其中優化器嘗試下降碗形的損失表面，使最小值位于碗的中心。

Figure 22. Left: regular momentum. Right: Nesterov momentum.圖22.左：常規動量。右：內斯特羅夫的勢頭。

As the illustrations show, Nesterov converges more quickly because it computes the gradient at a lookahead point, thus ensuring that the update approaches the minimizer more quickly.

如圖所示，Nesterov收斂更快，因為它可以在超前點計算梯度，從而確保更新更快地接近最小化器。

Let’s try out Nesterov on a subset of the learning rates and momentums we used for regular momentum, and see if it speeds up convergence. Let’s take a look at learning rate 0.001 and momentum 0.95:

讓我們根據用于常規動量的學習速度和動量子集來測試Nesterov，看看它是否能加快收斂速度??。讓我們看一下學習率0.001和動量0.95：

Figure 23: Effect of Nesterov momentum on lr 0.001 and momentum 0.95.圖23：內斯特羅夫動量對lr 0.001和動量0.95的影響。

Here, Nesterov does indeed seem to speed up convergence rapidly! How about if we increase the momentum to 0.99?

在這里，涅斯捷羅夫的確確實確實在加快融合的速度！如果將動量增加到0.99呢？

Figure 24: Effect of Nesterov momentum on lor 0.001 and momentum 0.99.圖24：內斯特羅夫動量對lor 0.001和動量0.99的影響。

Now, Nesterov actually converges more slowly on the training loss, and though it initially converges more quickly on validation loss, it slows down and is overtaken by momentum after around 50 epochs.

現在，涅斯捷羅夫實際上在訓練損失上收斂得較慢，盡管最初在驗證損失上收斂得更快，但它變慢了，并在大約50個紀元后被動量所超越。

How should we measure speed of convergence over all the training runs? Let’s take the loss that regular momentum achieves after 50 epochs, then determine how many epochs Nesterov takes to reach that same loss. We define the convergence ratio as this number of epochs divided by 50. If it less than one, then Nesterov converges more quickly than regular momentum; conversely, if it is greater, then Nesterov converges more slowly.

我們應該如何衡量所有訓練運行的收斂速度？讓我們以規則動量在50個紀元后達到的損失為例，然后確定Nesterov要達到相同的損失數個紀元。我們將收斂率定義為該時期數除以50。如果小于1，則Nesterov的收斂速度要快于常規動量。相反，如果更大，則涅斯捷羅夫收斂速度會更慢。

Figure 25. Ratio of epochs for Nesterov’s loss to converge to the regular momentum’s loss after 50 epochs. Training runs where Nesterov was faster are highlighted in green; slower in red; and runs where neither Nesterov nor regular momentum converged in yellow.圖25. Nesterov損失收斂到50個歷時之后的常規動量損失的歷時比率。 Nesterov更快的訓練運行以綠色突出顯示；紅色變慢并在Nesterov和常規動量都未收斂為黃色的地方運行。

We see that in most cases (10/14) adding Nesterov causes the training loss to decrease more quickly, as seen in Table 5. The same applies to a lesser extent (8/12) for the validation loss, in Table 6.

我們看到，在大多數情況下(10/14)，添加Nesterov會導致訓練損失更快地減少，如表5所示。對于表象6中的驗證損失，情況也是如此(8/12)較小。

There does not seem to be a clear relationship between the speedup from adding Nesterov and the other parameters (learning rate and momentum), though this can be an area for future investigation.

雖然添加Nesterov所帶來的加速與其他參數(學習率和動量)之間似乎沒有明確的關系，但是這可能是未來研究的領域。

Nesterov動量的SGD最佳驗證損失 (Best validation loss on SGD with Nesterov momentum)

Best validation loss: 0.2020
最佳驗證損失：0.2020
Associated training loss: 0.1945
相關的訓練損失：0.1945
Epochs to converge to minimum: 414
收斂到最小的時代：414
Params: learning rate 0.003, momentum 0.95
參數：學習率0.003，動量0.95

Figure 26. Minimum training and validation losses achieved by each training run. Minimum in each row is highlighted in green.圖26.每次培訓運行所獲得的最小培訓和驗證損失。每行的最小值以綠色突出顯示。

內斯特羅夫動力外賣的SGD (SGD with Nesterov momentum takeaways)

Nesterov momentum computes the gradient at a lookahead point in order to account for the effect of momentum.
為了考慮動量的影響，涅斯捷羅夫動量會在先行點計算梯度。
Nesterov generally converges more quickly compared to regular momentum.
與常規動量相比，內斯特羅夫的收斂速度通常更快。

RMSprop (RMSprop)

The main idea of RMSProp is to divide the gradient by an exponential average of its recent magnitude. The update equations are as follows:

RMSProp的主要思想是將梯度除以最近幅度的指數平均值。更新公式如下：

Figure 27: RMSprop update equations — taken from the Deep Learning Specialization by Andrew Ng圖27：RMSprop更新方程式-摘自Andrew Ng的Deep Learning專業版

RMSprop tries to normalize the size of the updates across different weights — in other words, reducing the update size when the gradient is large, and increasing it when the gradient is small. As an example, consider a weight parameter where the gradients are [5, 5, 5] (and assume that 𝛼=1). The denominator in the second equation is then 5, so the updates applied would be -[1, 1, 1]. Now, consider a weight parameter where the gradients are [0.5, 0.5, 0.5]; the denominator would be 0.5, giving the same updates -[1, 1, 1] as the previous case! In other words, RMSprop cares more about the direction (+ or -) of each weight than the magnitude, and tries to normalize the size of the update step for each of these weights.

RMSprop試圖標準化不同權重上的更新大小-換句話說，當梯度大時減小更新大小，而當梯度小時增大更新大小。例如，考慮一個權重參數，其中梯度為[5，5，5](并假定𝛼 = 1)。那么第二個等式中的分母為5，因此應用的更新為-[1，1，1]。現在，考慮權重參數，其中梯度分別為[0.5，0.5，0.5]；分母將為0.5，與前面的情況相同，更新為[[1，1，1]！換句話說，RMSprop更關心每個權重的方向(+或-)，而不是大小，并嘗試針對這些權重中的每個權重標準化更新步驟的大小。

This is different from vanilla SGD, which applies larger updates for weight parameters with larger gradients. Considering the above example where the gradient is [5, 5, 5], we can see that the resulting updates would be -[5, 5, 5], whereas for the [0.5, 0.5, 0.5] case the updates would be -[0.5, 0.5, 0.5]. Vanilla SGD thus is different from RMSprop in that the larger the gradient, the larger the update.

這不同于香草SGD，后者對具有較大梯度的權重參數應用較大的更新。考慮上面的示例，其中梯度為[5，5，5]，我們可以看到結果更新為-[5，5，5]，而對于[0.5，0.5，0.5]情況，更新為- [0.5，0.5，0.5]。因此，香草SGD與RMSprop的不同之處在于，梯度越大，更新越大。

學習率和rho如何影響RMSprop？ (How do learning rate and rho affect RMSprop?)

Let’s try out RMSprop while varying the learning rate 𝛼 (default 0.001) and the coefficient 𝜌 (default 0.9). Let’s first try setting 𝜌 = 0 and vary the learning rate:

讓我們嘗試RMSprop，同時改變學習率𝛼(默認值0.001)和系數𝜌(默認值0.9)。首先讓我們設置𝜌 = 0并改變學習率：

Figure 28: RMSProp training loss at different learning rates, with rho = 0.圖28：在rho = 0的情況下，不同學習率下的RMSProp訓練損失。

First lesson learned — don’t use RMSProp with 𝜌=0! This results in the update being as follows:

第一堂課—不要在𝜌 = 0時使用RMSProp！這導致更新如下：

Figure 29: RMSprop when rho = 0圖29：rho = 0時的RMSprop

Let’s try again over nonzero rho values. We first plot the train and validation losses for a small learning rate (1e-3).

讓我們再次嘗試非零的rho值。我們首先以小學習率(1e-3)繪制火車和驗證損失。

Figure 30: RMSProp at different rho values, with learning rate 1e-3.圖30：學習速率為1e-3的不同rho值的RMSProp。

Increasing rho seems to reduce both the training loss and validation loss, but with diminishing returns — the validation loss ceases to improve when increasing rho from 0.95 to 0.99.

增加rho似乎可以減少訓練損失和驗證損失，但是收益卻減少了-當rho從0.95增加到0.99時，驗證損失不再改善。

Let’s now take a look at what happens when we use a larger learning rate.

現在讓我們看一下使用較高學習率時會發生什么。

Figure 31: RMSProp at different rho values, with learning rate 3e-2.圖31：不同學習速度下的RMSProp，學習速率為3e-2。

Here, the training and validation losses entirely fail to converge!

在這里，訓練和驗證損失完全無法收斂！

Let’s take a look at the minimum training and validation losses across all parameters:

讓我們看一下所有參數的最小訓練和驗證損失：

Figure 32: Minimum training loss (left) and minimum validation loss (right) on RMSprop across different learning rates and rho values. Minimum value in each row is highlighted in green.圖32：不同學習率和rho值下RMSprop的最小訓練損失(左)和最小驗證損失(右)。每行的最小值以綠色突出顯示。

From the plots above, we find that once the learning rate reaches 0.01 or higher, RMSprop fails to converge.Thus, the optimal learning rate found here is around ten times as small as the optimal learning rate on SGD! One hypothesis is that the denominator term is much smaller than one, so it effectively scales up the update. Thus, we need to adjust the learning rate downward to compensate.

從上面的圖中可以發現，一旦學習率達到0.01或更高，RMSprop就無法收斂，因此，此處找到的最佳學習率約為SGD最佳學習率的十倍！一種假設是分母項遠小于分母項，因此它有效地擴大了更新范圍。因此，我們需要向下調整學習率以進行補償。

Regarding 𝜌, we can see from the graphs above the RMS performs the best on our data with high 𝜌 values (0.9 to 1). Even though the Keras docs recommend using the default value of 𝜌=0.9, it’s worth exploring other values as well — when we increased rho from 0.9 to 0.95, it substantially improved the best attained validation loss from 0.2226 to 0.2061.

關于𝜌，我們可以從上方的圖表中看出，RMS具有高𝜌值(0.9到1)，對我們的數據表現最佳。即使Keras文檔建議使用默認值𝜌 = 0.9，也值得探索其他值-當我們將rho從0.9增大到0.95時，它將獲得的最佳驗證損失從0.2226大大提高到0.2061。

RMSprop的最佳驗證損失 (Best validation loss on RMSprop)

Best validation loss: 0.2061
最佳驗證損失：0.2061
Associated training loss: 0.2408
相關的訓練損失：0.2408
Epochs to converge to minimum: 338
收斂到最少的紀元：338
Params: learning rate 0.001, rho 0.95
參數：學習率0.001，rho 0.95

RMSprop外賣 (RMSprop takeaways)

RMSprop seems to work at much smaller learning rates than vanilla SGD (about ten times smaller). This is likely because we divide the original update (dw) by the averaged gradient.
RMSprop的學習速度似乎比香草SGD小得多(約小十倍)。 這可能是因為我們將原始更新(dw)除以平均梯度。
Additionally, it seems to pay off to explore different values of 𝜌, contrary to the Keras docs’ recommendation to use the default value.
此外，探索 Ke的 不同值似乎 很有意義，這與Keras文檔建議使用默認值相反。

亞當 (Adam)

Adam is sometimes regarded as the optimizer of choice, as it has been shown to converge more quickly than SGD and other optimization methods [1]. essentially a combination of SGD with momentum and RMSProp. It uses the following update equations:

亞當有時被認為是選擇的優化器，因為它已被證明比SGD和其他優化方法收斂更快[1]。本質上是SGD與動力和RMSProp的組合。它使用以下更新方程式：

Figure 33: Adam update equations圖33：亞當更新方程式

Essentially, we keep a velocity term similar to the one in momentum — it is an exponential average of the gradient updates. We also keep a squared term, which is an exponential average of the squares of the gradients, similar to RMSprop. We also correct these terms by (1 — beta); otherwise, the exponential average will start off with lower values at the beginning, since there are no previous terms to average over. Then we divide the corrected velocity by the square root of the corrected square term, and use that as our update.

本質上，我們保持類似于動量中的速度項-它是梯度更新的指數平均值。我們還保留平方項，它是梯度平方的指數平均值，類似于RMSprop。我們還將這些條款更正為(1-beta)；否則，指數平均值將在開始時以較低的值開始，因為沒有先前的項可以進行平均。然后，我們將校正后的速度除以校正后的平方項的平方根，并將其用作更新。

學習率如何影響亞當？ (How does learning rate affect Adam?)

It has been suggested that the learning rate is more important than the β1 and β2 parameters, so let’s try varying the learning rate first, on a log scale from 1e-4 to 1:

有人建議，學習率比β1和β2參數更重要，因此讓我們首先嘗試以1e-4到1的對數刻度更改學習率：

Figure 34: Training loss (left) and validation loss (right) on Adam across learning rates.圖34：整個學習率對Adam的訓練損失(左)和驗證損失(右)。

We did not plot learning rates above 0.03, since they failed to converge. We see that as we increase the learning rate, the training and validation loss decrease more quickly — but only up to a certain point. Once we increase the learning rate beyond 0.001, the training and validation loss both start to become worse. This could be due to the ‘overshooting’ behavior illustrated in Figure 4.

我們沒有將學習率高于0.03，因為它們未能收斂。我們看到，隨著學習率的提高，訓練和驗證損失的減少速度會更快-但只能達到一定程度。一旦我們將學習率提高到0.001以上，訓練和驗證損失就會開始變得越來越糟。這可能是由于圖4中所示的“超調”行為。

So, which of the learning rates is the best? Let’s find out by plotting the best validation loss of each one.

那么，哪個學習率是最好的？讓我們通過繪制每一個的最佳驗證損失來找出答案。

Figure 35: Minimum training and validation loss on Adam across different learning rates.圖35：在不同學習率下對Adam的最小訓練和驗證損失。

We see that the validation loss on learning rate 0.001 (which happens to be the default learning rate) seems to be the best, at 0.2059. The corresponding training loss is 0.2077. However, this is still worse than the best SGD run, which achieved a validation loss of 0.1899 and training loss of 0.1945. Can we somehow beat that? Let’s try varying β1 and β2 and see.

我們看到學習率0.001(恰好是默認學習率)的驗證損失似乎是最好的，為0.2059。相應的訓練損失為0.2077。但是，這仍然比最佳SGD運行更糟糕，后者的驗證損失為0.1899，培訓損失為0.1945。我們能以某種方式擊敗它嗎？讓我們嘗試改變β1和β2看看。

β1和β2對亞當有何影響？ (How do β1 and β2 affect Adam?)

We try the following values for β1 and β2:

我們為β1和β2嘗試以下值：

beta_1_values = [0.5, 0.9, 0.95]
beta_2_values = [0.9, 0.99, 0.999]Figure 36: Training loss (left) and validation loss (right) across different values for beta_1 and beta_2.圖36：針對beta_1和beta_2的不同值的訓練損失(左)和驗證損失(右)。 Figure 37: Minimum training losses (left) and minimum validation losses (right). Minimum value in each row highlighted in green.圖37：最小訓練損失(左)和最小驗證損失(右)。每行的最小值以綠色突出顯示。

The best run is β1=0.5 and β2=0.999, which achieves a training loss of 0.2071 and validation loss of 0.2021. We can compare this against the default Keras params for Adam (β1=0.9 and β2=0.999), which achieves 0.2077 and 0.2059, respectively. Thus, it pays off slightly to experiment with different values of beta_1 and beta_2, contrary to the recommendation in the Keras docs — but the improvement is not large.

最佳運行是β1= 0.5和β2= 0.999，這將導致訓練損失為0.2071，驗證損失為0.2021。我們可以將其與Adam的默認Keras參數(β1= 0.9和β2= 0.999)進行比較，后者分別達到0.2077和0.2059。因此，與Keras文檔中的建議相反，使用不同的beta_1和beta_2值進行試驗會有所回報-但改進并不大。

Surprisingly, we were not able to beat the best SGD performance! It turns out that others have noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. While the reasons for this are beyond the scope of this article, it suggests that it pays off to experiment with different optimizers to find the one that works the best for your data.

令人驚訝的是，我們無法擊敗最佳SGD表現！事實證明，其他人已經注意到，在使用動量或其他優化算法的情況下，Adam有時比SGD的工作效果更差[2]。盡管造成這種情況的原因超出了本文的范圍，但它表明嘗試使用不同的優化器以找到最適合您數據的優化器是值得的。

最佳亞當驗證損失 (Best Adam validation loss)

Best validation loss: 0.2021
最佳驗證損失：0.2021
Associated training loss: 0.2071
相關的訓練損失：0.2071
Epochs to converge to minimum: 255
收斂到最小限度的時間：255
Params: learning rate 0.001, β1=0.5, and β2=0.999
參數：學習率0.001，β1= 0.5，β2= 0.999

亞當外賣 (Adam takeaways)

Adam is not guaranteed to achieve the best training and validation performance compared to other optimizers, as we found that SGD outperforms Adam.
與其他優化程序相比，Adam無法保證獲得最佳的培訓和驗證性能，因為我們發現SGD優于Adam。
Trying out non-default values for β1 and β2 can slightly improve the model’s performance.
試用β1和β2的非默認值可以稍微改善模型的性能。

阿達格勒 (Adagrad)

Adagrad accumulates the squares of gradients, and divides the update by the square root of this accumulator term.

Adagrad累加梯度的平方，然后用該累加項的平方根除以更新。

Figure 38: Adagrad update equation [3]圖38：Adagrad更新方程[3]

This is similar to RMSprop, but the difference is that it simply accumulates the squares of the gradients, without using an exponential average. This should result in the size of the updates decaying over time.

這類似于RMSprop，但不同之處在于，它僅累加了梯度的平方，而沒有使用指數平均值。這將導致更新的大小隨時間衰減。

Let’s try Adagrad at different learning rates, from 0.001 to 1.

讓我們以0.001到1的不同學習率嘗試Adagrad。

Figure 39: Adagrad at different learning rates. Left: training loss. Right: validation loss.圖39：不同學習率的Adagrad。左：訓練損失。正確：驗證損失。

The best training and validation loss are 0.2057 and 0.2310, using a learning rate of 3e-1. Interestingly, if we compare with SGD using the same learning rates, we notice that Adagrad keeps pace with SGD initially but starts to fall behind in later epochs.

使用3e-1的學習率，最佳訓練和驗證損失為0.2057和0.2310。有趣的是，如果我們使用相同的學習率與SGD進行比較，我們會注意到Adagrad最初與SGD保持同步，但在隨后的時代開始落后。

Figure 40: Adagrad vs SGD at same learning rate. Left: training loss. Right: validation loss.圖40：在相同的學習率下，Adagrad與SGD的關系。左：訓練損失。正確：驗證損失。

This is likely because Adagrad initially is dividing by a small number, since the gradient accumulator term has not accumulated many gradients yet. This makes the update comparable to that of SGD in the initial epochs. However, as the accumulator term accumulates more gradient, the size of the Adagrad updates decreases, and so the loss begins to flatten or even rise as it becomes more difficult to reach the minimizer.

這很可能是因為Adagrad最初會被一個小數除，因為梯度累加器項尚未累積很多梯度。這使得更新在初始時期可與SGD相提并論。但是，隨著累加器項累積更多的梯度，Adagrad更新的大小會減小，因此，隨著變得越來越難以達到最小化器，損耗開始趨于平坦甚至上升。

Surprisingly, we observe the opposite effect when we use a large learning rate (3e-1):

令人驚訝的是，當我們使用較大的學習率(3e-1)時，我們觀察到相反的效果：

Figure 41: Adagrad vs SGD at large learning rate (0.316). Left: training loss. Right: validation loss.圖41：高學習率(0.316)時的Adagrad與SGD。左：訓練損失。正確：驗證損失。

At large learning rates, Adagrad actually converges more quickly than SGD! One possible explanation is that while large learning rates cause SGD to take excessively large update steps, Adagrad divides the updates by the accumulator terms, essentially making the updates smaller and more ‘optimal.’

在較高的學習速度下，Adagrad實際上比SGD融合的速度更快！一種可能的解釋是，雖然較高的學習率會導致SGD采取過大的更新步驟，但Adagrad會將更新除以累加器項，從根本上使更新更小，更“最優”。

Let’s look at the minimum training and validation losses across all params:

讓我們看一下所有參數的最小訓練和驗證損失：

Figure 42: Minimum training and validation losses for Adagrad.圖42：Adagrad的最低培訓和驗證損失。

We can see that the best learning rate for Adagrad, 0.316, is significantly larger than that for SGD, which was 0.03. As mentioned above, this is most likely because Adagrad divides by the accumulator terms, causing the effective size of the updates to be smaller.

我們可以看到，Adagrad的最佳學習率為0.316，大大高于SGD的0.03。如上所述，這很可能是因為Adagrad除以累加器項，導致更新的有效大小較小。

Adagrad的最佳驗證損失 (Best validation loss on Adagrad)

Best validation loss: 0.2310
最佳驗證損失：0.2310
Associated training loss: 0.2057
相關的訓練損失：0.2057
Epochs to converge to minimum: 406
收斂到最小限度的時代：406
Params: learning rate 0.312
參數：學習率0.312

阿達格勒外賣 (Adagrad takeaways)

Adagrad accumulates the squares of gradients, then divides the update by the square root of the accumulator term.
Adagrad累加梯度的平方，然后將更新除以累加器項的平方根。
The size of Adagrad updates decreases over time.
Adagrad更新的大小會隨著時間的推移而減小。
The optimal learning rate for Adagrad is larger than for SGD (at least 10x in our case).
Adagrad的最佳學習率大于SGD(在我們的案例中至少為10倍)。

循環學習率 (Cyclic Learning Rate)

Cyclic Learning Rate is a method that lets the learning rate vary cyclically between a min and max value [4]. It claims to eliminate the need to tune the learning rate, and can help the model training converge more quickly.

循環學習率是一種使學習率在最小值和最大值之間周期性變化的方法[4]。它聲稱不需要調整學習速率，并且可以幫助模型訓練更快地收斂。

Figure 43: Cyclic learning rate using a triangular cycle圖43：使用三角循環的循環學習率

We try the cyclic learning rate with reasonable learning rate bounds (base_lr=0.1, max_lr=0.4), and a step size equal to 4 epochs, which is within the 4–8 range suggested by the author.

我們嘗試使用合理的學習速率邊界(base_lr = 0.1，max_lr = 0.4)，且步長等于4個紀元，在作者建議的4–8范圍內，以周期性學習率進行學習??。

Figure 44: Cyclic learning rate. Left: Train loss. Right: validation loss.圖44：循環學習率。左：火車丟失。正確：驗證損失。

We observe cyclic oscillations in the training loss, due to the cyclic changes in the learning rate. We also see these oscillations to a lesser extend in the validation loss.

由于學習率的周期性變化，我們觀察到訓練損失中的周期性振蕩。我們還看到這些振蕩在驗證損失中的延伸較小。

最佳CLR培訓和驗證損失 (Best CLR training and validation loss)

Best validation loss: 0.2318
最佳驗證損失：0.2318
Associated training loss: 0.2267
相關的訓練損失：0.2267
Epochs to converge to minimum: 280
收斂到最少的時代：280
Params: Used the settings mentioned above. However, we may be able to obtain better performance by tuning the cycle policy (e.g. by allowing the max and min bounds to decay) or by tuning the max and min bounds themselves. Note that this tuning may offset the time savings that CLR purports to offer.
參數：使用上述設置。但是，通過調整循環策略(例如，允許最大和最小邊界衰減)或自行調整最大和最小邊界，我們也許可以獲得更好的性能。請注意，此調整可能會抵消CLR聲稱可以節省的時間。

CLR外賣店 (CLR takeaways)

CLR varies the learning rate cyclically between a min and max bound.
CLR在最小和最大界限之間周期性地改變學習率。
CLR may potentially eliminate the need to tune the learning rate while attaining similar performance. However, we did not attain similar performance.
CLR可能會消除在達到類似性能的同時調整學習速率的需求。 但是，我們沒有達到類似的性能。

比較方式 (Comparison)

So, after all the experiments above, which optimizer ended up working the best? Let’s take the best run from each optimizer, i.e. the one with the lowest validation loss:

那么，經過以上所有實驗，哪個優化程序最終表現最佳？讓我們從每個優化器中獲得最佳運行，即驗證損失最小的運行器：

Figure 45: Best validation loss achieved by each optimizer.圖45：每個優化器實現的最佳驗證損失。

Surprisingly, SGD achieves the best validation loss, and by a significant margin. Then, we have SGD with Nesterov momentum, Adam, SGD with momentum, and RMSprop, which all perform similarly to one another. Finally, Adagrad and CLR come in last, with losses significantly higher than the others.

出人意料的是，SGD的確最大程度地降低了驗證損失。然后，我們得到了具有Nesterov動量的SGD，Adam，具有動量的SGD和RMSprop，它們的性能都相似。最后，Adagrad和CLR排名倒數第二，損失明顯高于其他公司。

What about training loss? Let’s plot the training loss for the runs selected above:

訓練損失呢？讓我們繪制以上所選跑步的訓練損失：

Figure 46: Training loss achieved by each optimizer for best runs selected above.圖46：對于上面選擇的最佳運行，每個優化器都達到了訓練損失。

Here, we see some correlation with the validation loss, but Adagrad and CLR perform better than their validation losses would imply.

在這里，我們看到了與驗證損失的一些相關性，但是Adagrad和CLR的表現要好于其驗證損失所暗示的。

What about convergence? Let’s first take a look at how many epochs it takes each optimizer to converge to its minimum validation loss:

那么融合呢？首先讓我們看一下每個優化器收斂到最小驗證損失所需的時間：

Figure 47: Num epochs to converge to minimizer.圖47：收斂到最小化器的時間段。

Adam is clearly the fastest, while SGD is the slowest.

亞當顯然是最快的，而新幣是最慢的。

However, this may not be a fair comparison, since the minimum validation loss for each optimizer is different. How about measuring how many epochs it takes each optimizer to reach a fixed validation loss? Let’s take the worst minimum validation loss of 0.2318 (the one achieved by CLR), and compute how many epochs it takes each optimizer to reach that loss.

但是，這可能不是一個公平的比較，因為每個優化程序的最小驗證損失是不同的。如何衡量每個優化器達到固定驗證損失所需的時間？讓我們假設最差的最小驗證損失為0.2318(CLR實現的損失)，并計算每個優化程序達到該損失所花費的時間。

Figure 48: Number of epochs to converge to worst minimum validation loss (0.2318, achieved by CLR).圖48：收斂到最差的最小驗證損失的時期數(0.2318，由CLR實現)。

Again, we can see that Adam does converge more quickly to the given loss than any other optimizer, which is one of its purported advantages. Surprisingly, SGD with momentum seems to converge more slowly than vanilla SGD! This is because the learning rate used by the best SGD with momentum run is lower than that used by the best vanilla SGD run. If we hold the learning rate constant, we see that momentum does in fact speed up convergence:

再次，我們可以看到Adam確實比任何其他優化器都能更快地收斂到給定的損耗，這是其聲稱的優勢之一。出乎意料的是，具有勢頭的SGD收斂似乎比香草SGD慢！這是因為具有動量運行的最佳SGD使用的學習速率低于最佳原始SGD運行的學習速率。如果我們將學習率保持恒定，我們會發現動量確實會加速收斂：

Figure 49: Comparing SGD and SGD with momentum.圖49：將SGD和SGD與動量進行比較。

As seen above, the best vanilla SGD run (blue) converges more quickly than the best SGD with momentum run (orange), since the learning rate is higher at 0.03 compared to the latter’s 0.01. However, when hold the learning rate constant by comparing with vanilla SGD at learning rate 0.01 (green), we see that adding momentum does indeed speed up convergence.

如上所示，最好的香草SGD運行(藍色)比帶有動量運行(橙色)的最佳SGD收斂更快，因為學習率比后者的0.01高，為0.03。但是，當通過與學習速率為0.01(綠色)的香草SGD進行比較而使學習速率保持恒定時，我們看到增加動量確實確實會加快收斂。

為什么亞當無法擊敗香草SGD？ (Why does Adam fail to beat vanilla SGD?)

As mentioned in the Adam section, others have also noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. To quote Vitaly Bushaev’s article on Adam, “after a while people started noticing that despite superior training time, Adam in some areas does not converge to an optimal solution, so for some tasks (such as image classification on popular CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with momentum.” [2] Though the exact reasons are beyond the scope of this article, others have shown that Adam may converge to sub-optimal solutions, even on convex functions.

正如亞當部分中提到的那樣，其他人也注意到亞當有時在使用動量或其他優化算法的情況下比SGD表現更差[2]。引用Vitaly Bushaev關于Adam的文章，“一段時間后，人們開始注意到，盡管訓練時間很長，但Adam在某些領域并沒有收斂到最佳解決方案，因此對于某些任務(例如，流行的CIFAR數據集上的圖像分類)來說仍然只有通過有力地應用SGD才能獲得最先進的結果。” [2]盡管確切的原因不在本文討論范圍之內，但其他證據表明，即使在凸函數上，亞當也可能收斂于次優解。

結論 (Conclusions)

Overall, we can conclude that:

總的來說，我們可以得出以下結論：

You should tune your learning rate — it makes a large difference in your model’s performance, even more so than the choice of optimizer.
您應該調整學習速度-與選擇優化器相比，它對模型性能的影響很大。
On our data, vanilla SGD performed the best, but Adam achieved performance that was almost as good, while converging more quickly.
根據我們的數據，香草SGD表現最好，但是Adam的表現幾乎一樣好，同時收斂速度更快。
It is worth trying out different values for rho in RMSprop and the beta values in Adam, even though Keras recommends using the default params.
即使Keras建議使用默認參數，也值得嘗試使用RMSprop中的rho和Adam中的beta值不同的值。

翻譯自: https://towardsdatascience.com/effect-of-gradient-descent-optimizers-on-neural-net-training-d44678d27060