梯度下降法_梯度下降
梯度下降法
介紹 (Introduction)
Gradient Descent is a first order iterative optimization algorithm where optimization, often in Machine Learning refers to minimizing a cost function J(w) parameterized by the predictive model's parameters. Additionally, by first-order we mean that Gradient Descent only takes into account the first derivative when performing updates of the parameters.
梯度下降是一階迭代優化算法,其中優化(通常在機器學習中)是指最小化由預測模型的參數參數化的成本函數J(w)。 另外,一階是指梯度下降在執行參數更新時僅考慮一階導數。
No matter what happens, at some point in your Machine Learning or Deep Learning journey, you will hear about something called Gradient Descent. A vital piece to the puzzle for many Machine Learning algorithms, I highly recommend that it’s not treated as a black box by practitioners.
無論發生什么情況,在您的機器學習或深度學習過程中的某個時刻,您都會聽到有關梯度下降的信息。 對于許多機器學習算法來說,這是一個難題的重要組成部分,我強烈建議從業者不要將其視為黑匣子。
In order to minimize the cost function, we aim to either find the global minimum which is quite feasible if the objective function is convex. Nonetheless, in many scenarios such as deep learning task our objective function tends to be non-convex, thereby finding the lowest possible value of the objective function has been highly regarded as a suitable solution.
為了最小化成本函數,我們的目標是找到一個全局最小值,如果目標函數是凸的 ,則這是非常可行的。 盡管如此,在諸如深度學習任務之類的許多情況下,我們的目標函數往往是非凸的,因此,將目標函數的盡可能低的值視為高度合適的解決方案。
Figure 1: Convex function and Non-convex function examples.圖1:凸函數和非凸函數示例。To find the local minimum of the function we take steps proportional to the negative of the gradient of the function at the current point (Source: Wikipedia). Frankly, we start with a random point on the objective function and move in the negative direction towards the global/local minimum.
為了找到該函數的局部最小值,我們采取與該函數在當前點處的梯度的負值成比例的步驟(來源: Wikipedia )。 坦白說,我們從目標函數上的隨機點開始,然后朝整個全局/局部最小值的負方向移動。
MLextend)MLextend )There are many different adaptations that could be made to Gradient Descent to make it run more efficiently for different scenarios; Each adaptation to Gradient Descent has its own pros and cons as we will share below:
可以對Gradient Descent進行多種調整,以使其在不同情況下更有效地運行。 每種適應Gradient Descent的方法都有其優缺點,我們將在下面分享:
批次梯度下降 (Batch Gradient Descent)
Batch Gradient Descent refers to the sum of all observations on each iteration. In other words, Batch Gradient Descent calculates the error for each observation in the batch (remember this is the full training data) and updates the predictive model only after all observations have been evaluated — A more technical way to say this is “Batch Gradient Descent performs parameter updates at the end of each epoch” (one epoch refers to one iteration through the entire training data).
批梯度下降是指每次迭代中所有觀測值的總和。 換句話說,批次梯度下降計算批次中每個觀測值的誤差(請記住,這是完整的訓練數據)并僅在評估所有觀測值之后才更新預測模型-一種更技術性的說法是“ 批次梯度下降”在每個時期的末尾執行參數更新 (一個時期是指整個訓練數據的一次迭代)。
Figure 3: 2d Representation of Batch Gradient Descent approaching and Converging at the Global Minimum.圖3:批次梯度下降的2d表示接近并收斂于全局最小值。Pros
優點
- More stable convergence and error gradient than Stochastic Gradient descent 比隨機梯度下降法更穩定的收斂和誤差梯度
- Embraces the benefits of vectorization 擁抱矢量化的好處
- A more direct path is taken towards the minimum 朝著最小方向走更直接的路徑
- Computationally efficient since updates are required after the run of an epoch 計算效率高,因為在運行紀元后需要更新
Cons
缺點
- Can converge at local minima and saddle points 可以收斂于局部最小值和鞍點
- Slower learning since an update is performed only after we go through all observations 學習更新后才進行學習,因此學習速度較慢
小批量梯度下降 (Mini-Batch Gradient Descent)
If Batch Gradient Descent sums over all observation on each iteration, Mini Batch Gradient Descent sums over a lower number of samples (a mini-batch of the samples) on each iteration — this variant reduces the variance of the gradient since we sum over a designated number of samples (depending on the mini-batch size) on each update.
如果每次迭代的所有觀察結果都具有“批次梯度下降”的總和,則“每次迭代”中的較小數量的樣本(樣本的小批量)的“最小批次梯度下降”總和-由于我們在指定的點求和,因此該變體減小了梯度的方差每次更新的樣本數(取決于最小批量)。
Note: This variation of Gradient Descent is often the recommended technique among Deep Learning practitioners, but we must consider there is an extra hyperparameter which is the “batch sizes”
注意 :這種梯度下降的變化通常是深度學習從業人員推薦的技術,但是我們必須考慮還有一個額外的超參數,即“批量大小”
Figure 4: 2d Representation of Mini-Batch Gradient approaching Minimum; (Source: https://engmrk.com/mini-batch-gd/)圖4:接近最小的小批量梯度的2d表示; (來源: https : //engmrk.com/mini-batch-gd/ )Pros
優點
- Convergence is more stable than Stochastic Gradient Descent 收斂比隨機梯度下降更穩定
- Computationally efficient 計算效率高
- Fast Learning since we perform more updates 快速學習,因為我們執行更多更新
Cons
缺點
- We have to configure the Mini-Batch size hyperparameter 我們必須配置Mini-Batch大小超參數
隨機梯度下降 (Stochastic Gradient Descent)
Stochastic Gradient Descent sums the error of an individual observation and performs an update to the model on each observation — This is the same as setting the number of Mini Batches to be equal to m, where m is the number of observations.
隨機梯度下降可將單個觀測值的誤差相加,并對每個觀測值執行模型更新-這與將“迷你批次”的數量設置為等于m相同 ,其中m是觀測值的數量。
Figure 5: 2d Representation of Mini-Batch Gradient approaching Minimum; (Source: https://engmrk.com/mini-batch-gd/)圖5:接近最小的小批量梯度的2d表示; (來源: https : //engmrk.com/mini-batch-gd/ )Pros
優點
- Only a single observation is being processed by the network so it is easier to fit into memory 網絡僅處理一個觀測值,因此更容易放入內存
- May (likely) to reach near the minimum (and begin to oscillate) faster than Batch Gradient Descent on a large dataset 在大型數據集上,可能(可能)達到接近最小值(并開始振蕩)的速度比批次梯度下降快
- The frequent updates create plenty of oscillations which can be helpful for getting out of local minimums. 頻繁的更新會產生大量的振蕩,這有助于擺脫局部最小值。
Cons
缺點
- Can veer off in the wrong direction due to frequent updates 由于頻繁更新,可能會朝錯誤的方向偏離
- Lose the benefits of vectorization since we process one observation per time 由于我們每次處理一個觀測值,因此失去了矢量化的好處
- Frequent updates are computationally expensive due to using all resources for processing one training sample at a time 由于一次使用所有資源來一次處理一個訓練樣本,因此頻繁更新的計算量很大
結語 (Wrap Up)
Optimization is a major part of Machine Learning and Deep Learning. A simple and very popular optimization procedure that is employed with many Machine Learning algorithms is called Gradient descent, and there are 3 ways we can adapt Gradient Descent to perform in a specific way that suits our needs.
優化是機器學習和深度學習的主要部分。 許多機器學習算法采用的一種簡單且非常流行的優化程序稱為梯度下降,我們可以通過3種方法使梯度下降以適合我們需求的特定方式執行。
Let’s continue the Conversation on LinkedIn!
讓我們繼續在LinkedIn上進行對話!
翻譯自: https://towardsdatascience.com/gradient-descent-811efcc9f1d5
梯度下降法
總結
以上是生活随笔為你收集整理的梯度下降法_梯度下降的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 英特尔最新 Evo 笔记本认证已允许搭载
- 下一篇: 学习机器学习的项目_辅助项目在机器学习中