當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

合奏：机器学习中唯一（几乎）免费的午餐

發布時間：2023/12/15 编程问答 43 豆豆

生活随笔收集整理的這篇文章主要介紹了合奏：机器学习中唯一（几乎）免费的午餐小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

弦樂合奏音源

A notebook accompanying this post can be found here.

隨此帖子附帶的筆記本可以在這里找到。

I am grateful to Tetyana Drobot and Igor Pozdeev for their comments and suggestions.

我感謝 Tetyana Drobot 和 Igor Pozdeev 的意見和建議。

摘要 (Summary)

In this post, I cover the somewhat overlooked topic of ensemble optimization. I begin with a brief overview of some common ensemble techniques and outline their weaknesses. I then introduce a simple ensemble optimization algorithm and demonstrate how to apply it to build ensembles of neural networks with Python and PyTorch. Towards the end of the post, I discuss the effectiveness of ensemble methods in deep learning in the context of the current literature on the loss surface geometry of neural networks.

在這篇文章中，我介紹了集成優化中一個被忽略的話題。首先，我簡要概述了一些常用的集成技術，并概述了它們的弱點。然后，我介紹一個簡單的集成優化算法，并演示如何將其應用到使用Python和PyTorch構建神經網絡的集成中。在文章的結尾，我將在神經網絡的損失表面幾何的最新文獻的背景下，討論集成方法在深度學習中的有效性。

Key takeaways:

關鍵要點:

Strong ensembles consist of models that are both accurate and diverse
強大的合奏包含準確且多樣的模型
There are ensemble methods that admit realistic target functions which are not suitable as direct optimization objectives for ML models (think of using cross-entropy for training while being interested in some other metric like accuracy)
有一些集成方法可以接受現實的目標函數，這些函數不適合作為ML模型的直接優化目標(請考慮使用交叉熵進行訓練，同時對準確性等其他度量標準感興趣)
Ensembling improves the performance of neural networks not only by dampening their inherent sensitivity to noise but also by combining qualitatively different and uncorrelated solutions
集成不僅可以降低神經網絡對噪聲的固有敏感性，而且可以結合質上不同且不相關的解決方案，從而提高了神經網絡的性能。

The post is organized as follows:I. IntroductionII. Ensemble OptimizationIII. Building Ensembles of Neural NetworksIV. Ensembles of Neural Networks: the Role of Loss Surface GeometryV. Conclusion

該職位的結構如下: 一，導言二。集成優化 III。建立神經網絡集成體 IV。神經網絡集成:損失表面幾何的作用 V.結論

一，引言 (I. Introduction)

An ensemble is a collection of models designed to outperform every single one of them by combining their predictions. Strong ensembles comprise models that are accurate, performing well on their own, yet diverse in the sense of making different mistakes. This resonates with me deeply, as I am a finance professional—ensembling is akin to building a robust portfolio consisting of many individual assets and sacrificing higher expected returns on some of them in favor of an overall reduction in risk by diversifying investments. “Diversification is the only free lunch in finance” is the quote attributed to Harry Markowitz, the father of the Modern Portfolio Theory. Given that ensembling and diversification are conceptually related, and in some problems, the two are mathematically equivalent, I decided to give the post its title.

集合是一組模型，旨在通過結合它們的預測來勝過其中的每個模型。強大的合奏包含準確的模型，可以很好地表現的模型，但是在犯不同錯誤的意義上卻是多樣的。當我是一名金融專業人員時，這深深地引起了我的共鳴-類似于建立一個由許多個人資產組成的穩健的投資組合，并犧牲其中的一些較高的預期收益，以期通過分散投資來全面降低風險。現代投資組合理論之父哈里·馬克維茲(Harry Markowitz)的話是:“多元化是金融界唯一的免費午餐”。考慮到集合和多樣化在概念上是相關的，并且在某些問題上，兩者在數學上是等效的，所以我決定給該職位起標題。

Why almost, though? Because there is always a lingering problem of computational cost given how resource hungry the most powerful models (yes, neural networks) are. In addition to that, ensembling can hurt interpretability of more transparent machine learning algorithms like decision trees by blurring the decision boundaries of individual models — this point does not really apply to neural networks for which the issue of interpretability arises already on the individual model level.

為什么差不多呢？鑒于資源的匱乏，最強大的模型(是的，神經網絡)總是存在一個計算成本問題。除此之外，集成還會模糊單個模型的決策邊界，從而損害決策樹等更透明的機器學習算法的可解釋性，這一點實際上不適用于已經在單個模型級別出現可解釋性問題的神經網絡。

There are several approaches to building ensembles:

有幾種構建合奏的方法:

Bagging bootstraps the training set, estimates many copies of a model on the resulting samples, and then averages their predictions.
套袋引導訓練集，在所得樣本上估計模型的許多副本，然后平均其預測。
Boosting sequentially reweights the training samples forcing the model to attend to the training examples with higher loss values.
Boosting順序地對訓練樣本進行加權，迫使模型以較高的損失值參加訓練樣本。
Stacking uses a separate validation set to train a meta-model that combines predictions of multiple models.
跟蹤使用單獨的驗證集來訓練結合了多個模型預測的元模型。

See, for example, this post by Gilbert Tanner or the one by Joseph Rocca, or this post by Juhi Ramzai for an extensive overview of these methods.

見，例如，這篇文章由吉爾伯特坦納或在一個由約瑟夫·羅卡，或該職位由朱希Ramzai對這些方法的廣泛概述。

Of course, the methods above come with some common problems. First, given a set of trained models how to select the ones that are most likely to generalize well? In the case of stacking, this question would read ‘how to reduce the number of ensemble candidates to a manageable amount so the stacking model can handle them without a large validation set or high risk of overfitting?’ Well, just pick the best performing models and maybe apply weights inversely proportional to their loss, right? Wrong. Though often it is a good starting point. Recall that a good ensemble consists of both accurate and diverse models: pooling several highly accurate models with strongly correlated predictions would typically result in all models stepping on the same rake.

當然，以上方法存在一些常見問題。首先，給定一組訓練有素的模型，如何選擇最有可能泛化的模型？在堆疊的情況下，該問題將讀為“如何將整體候選者的數量減少到可管理的數量，以便堆疊模型可以處理它們而無需大量的驗證集或過度擬合的高風險？” 好吧，只要選擇性能最好的模型，然后將權重與它們的損失成反比，對吧？錯誤。盡管通常這是一個很好的起點。回想一下，一個好的集合包括準確的模型和多樣的模型:將具有高度相關的預測的幾個高度準確的模型合并在一起通常會導致所有模型都踩著相同的耙子。

The second problem is more subtle. Often the machine learning algorithms we train are all but glorified feature extractors, i.e. the objective in a real-life application might differ significantly from the loss function used to train a model. For instance, the cross-entropy loss is a staple in classification tasks in deep learning because of its differentiability and stable numerical behavior during optimization, however, depending on the domain we might be interested in accuracy, F1 score or false negative rate. As a concrete example consider classifying extreme weather events like floods or hurricanes, where the cost of making a Type II error (false negative) could be astronomically high rendering even the accuracy let alone the cross-entropy useless as an evaluation metric. Similarly, in a regression setting, the common loss function is the mean squared error. In finance, for example, it is common to train the same model for every asset in the sample predicting the return over the next period, while in reality there are hundreds of assets in multiple portfolios with optimization objectives similar to the ones encountered in reinforcement learning and optimal control: multiple time horizons along with state and path dependencies. In any case, you are neither judged by nor compensated for low MSE (unless you are in academia).

第二個問題更加微妙。通常，我們訓練的機器學習算法只是美化了特征提取器，即，實際應用中的目標可能與訓練模型所使用的損失函數有很大差異。例如，交叉熵損失是深度學習中分類任務的主要內容，因為它的可區分性和優化過程中穩定的數值行為，但是，根據領域，我們可能對準確性，F1得分或假陰性率感興趣。作為一個具體示例，請考慮對洪水或颶風等極端天氣事件進行分類，其中發生II型錯誤(假陰性)的成本在天文上可能會很高，即使是準確性，更不用說交叉熵作為評估指標了。同樣，在回歸設置中，共同損失函數是均方誤差。例如，在金融領域，通常針對樣本中的每種資產訓練相同的模型以預測下一時期的回報，而實際上，多個投資組合中有數百種資產的優化目標與強化學習中遇到的目標相似。最佳控制:多個時間范圍以及狀態和路徑依賴性。無論如何，您都不會因MSE偏低而受到評判或補償(除非您處于學術界)。

In this post I thoroughly discuss the ensemble optimization algorithm of Caruana et al. (2004) which addresses the problems outlined above. The algorithm can be broadly described as model-free greedy stacking, i.e. at every optimization step the algorithm either adds a new model to the ensemble or changes the weights of the current constituents minimizing the total loss without any overarching trainable model guiding the selection process. Equipped with several features allowing it to alleviate the overfitting problem, the Caruana et al. (2004) approach also allows building ensembles optimizing custom metrics that may differ from those used to train individual models, thus addressing the second problem. I further demonstrate how to apply the algorithm: first, to a simple example with a closed form solution and next, to a realistic problem by building an optimal ensemble of neural networks for the MNIST dataset (a complete PyTorch implementation can be found here). Towards the end of the post, I explore the mechanisms underpinning the effectiveness of ensembles in deep learning and discuss the current literature on the role of the loss surface geometry in the generalization properties of neural networks.

在這篇文章中，我徹底討論了Caruana等人的集成優化算法。 (2004年)解決了上面概述的問題。該算法可以廣義地描述為無模型貪婪堆疊，即在每個優化步驟中，該算法要么向集合添加新模型，要么更改當前成分的權重以使總損失最小化，而無需任何總體可訓練的模型來指導選擇過程。 Caruana等人具有一些功能，可以減輕過擬合的問題。 (2004年)的方法還允許構建整體來優化可能不同于用于訓練單個模型的自定義指標，從而解決第二個問題。我進一步演示了如何應用該算法:首先，通過一個封閉形式的解決方案來解決一個簡單的示例，其次通過為MNIST數據集構建神經網絡的最佳集合來解決一個現實問題(可以在此處找到完整的PyTorch實現)。在文章的結尾，我探索了深度學習中集成體有效性的機制，并討論了有關損失表面幾何形狀在神經網絡泛化特性中的作用的當前文獻。

The remainder of the post is structured as follows: Section II presents the ensemble optimization approach of Caruana et al. (2004) and illustrates it with a simple numerical example. In Section III I optimize an ensemble of neural networks for the MNIST dataset (PyTorch implementation). Section IV briefly discusses the literature on the optimization landscape in deep learning and its impact on ensembling. Section V concludes.

文章的其余部分結構如下: 第二部分介紹了Caruana等人的整體優化方法。 (2004年) ，并通過一個簡單的數值示例進行了說明。在第三部分中，我為MNIST數據集( PyTorch實現 )優化了神經網絡的集成。第四部分簡要討論了深度學習中的優化前景及其對集成的影響的文獻。第五節總結。

II. Ensemble Optimization: the Caruana et al. (2004) Algorithm

二。 集成優化: Caruana等人。 (2004) 算法

The approach of Caruana et al. (2004) is rather straightforward. Given a set of trained models and their predictions on a validation set, a variant of their ensemble construction algorithm is as follows:

Caruana等人的方法。 (2004年) 非常簡單。給定一組訓練有素的模型及其對驗證集的預測，其集成構造算法的變體如下:

Set inint_size— the number of models in the initial ensemble and max_iter — the maximum number of iterations

設置inint_size —初始集合中的模型數，以及max_iter —最大迭代數

Initialize the ensemble with init_size best performing models by averaging their predictions and computing the total ensemble loss

通過對init_size最佳的模型進行平均并計算總合計損耗，使用init_size最佳的模型初始化合計

Add to the ensemble the model in the set (with replacement) which minimizes the total ensemble loss

將集合中的模型添加到集合中(替換)，以最大程度減少集合損失

Repeat Step 3 until max_iter is reached

重復步驟3，直到達到max_iter

This version of the algorithm includes a couple of features designed to prevent overfitting the validation set. First, initializing the ensemble with several well-performing models forms a strong initial ensemble; second, drawing models with replacement practically guarantees that the ensemble loss on the validation set does not increase as the algorithm iterations progress — if adding another model can not further improve the ensemble loss the algorithm adds copies of the incumbent models essentially adjusting their weights in the final prediction. This weight adjustment property allows thinking of the algorithm as model-free stacking. Another interesting feature of this approach is that loss functions used for ensemble construction and to train individual models are not required to be the same: as mentioned earlier, often we train models with a particular loss function because of its mathematical or computational convenience in the (reasonable) hope that the models will generalize well with a related performance metric which is hard to optimize directly. Indeed, the value of cross-entropy on the test set in a malignant tumor classification task should not be our primary concern, in contrast to, for instance, the false negative rate.

該算法的此版本包括一些旨在防止過擬合驗證集的功能。首先，用幾個性能良好的模型初始化集成體，形成一個強大的初始集成體；其次，替換模型的繪制實際上保證了驗證集上的集合損失不會隨著算法迭代的進行而增加-如果添加另一個模型不能進一步改善集合損失，則算法會添加現有模型的副本，從而從本質上調整它們的權重。最終預測。此權重調整屬性允許將算法視為無模型堆疊。這種方法的另一個有趣特征是，用于整體構建和訓練各個模型的損失函數不必相同:如前所述，由于(或)中的數學或計算方便性，我們經常訓練具有特定損失函數的模型。合理的)，希望這些模型能夠很好地推廣具有難以直接優化的相關性能指標。的確，與例如假陰性率相比，在惡性腫瘤分類任務中測試集上的交叉熵值不應成為我們的主要關注點。

The following Python function implements the algorithm:

以下Python函數實現了該算法:

Caruana et al. (2004) ensemble selection algorithm Caruana等。 (2004)整體選擇算法

Consider the following toy example: assume we have 10 models with zero-mean normally distributed uncorrelated predictions. Furthermore, assume that the variance of the predictions decreases linearly from 10 to 1, i.e. the first model has the highest variance and the last model has the lowest. Given a sample of data, the goal is to build an ensemble minimizing the mean squared error against the ground truth of 0. Note, that in the context of the Caruana et al. (2004) algorithm ‘build an ensemble’ means assigning a weight between 0 and 1 to each model’s predictions such that the weighted prediction minimizes the MSE, subject to the constraint that all weights sum up to 1.

考慮以下玩具示例:假設我們有10個模型，這些模型具有零均值正態分布的不相關預測。此外，假設預測方差從10線性減少到1，即第一個模型的方差最大，而最后一個模型的方差最低。給定一個數據樣本，目標是建立一個集合，以最小化相對于地面實數為0的均方誤差。注意，在Caruana等人的上下文中。 (2004年)算法“構建整體”是指為每個模型的預測分配介于0和1之間的權重，以使加權的預測最小化MSE，但前提是所有權重之和為1。

The finance aficionados would recognize a special case of the minimum variance optimization problem by thinking of the models’ predictions as returns on some assets, and the optimization objective as minimizing portfolio variance. The problem has a closed form solution:

通過將模型的預測視為某些資產的收益，并將模型的優化目標視為最小的投資組合方差，金融愛好者將認識到最小方差優化問題的特例。該問題有一個封閉式解決方案:

where w is the vector of model weights, and Σ is the variance-covariance matrix of predictions. In our case the predictions are uncorrelated and the off-diagonal elements of Σ are zero. The following code snippet solves this toy problem by both using the ensemble_selector function and the analytical approach, it also constructs a simple ensemble by averaging the predictions:

其中w是模型權重的向量，而Σ是預測的方差-協方差矩陣。在我們的情況下，預測是不相關的， Σ的非對角元素為零。以下代碼段通過使用ensemble_selector函數和解析方法來解決此玩具問題，并且還通過平均預測來構造一個簡單的集合:

Figure 1 below compares weights implied by the ensemble optimization (in blue) and the closed form solution (in orange). The results match pretty closely, especially given that to compute the analytical solution we use the true variances and not the sample estimates. Note, that although the models with low prediction uncertainty receive higher weights, the weights of the high uncertainty models do not go to zero: the predictions are uncorrelated, and we can always reduce the variance of a weighted sum of random variables by adding an uncorrelated variable (with finite variance, of course).

下圖1比較了集成優化(藍色)和封閉式解決方案(橙色)所隱含的權重。結果非常接近，特別是考慮到計算解析解時，我們使用的是真實方差，而不是樣本估計值。請注意，盡管預測不確定性較低的模型獲得的權重較高，但不確定性較高的模型的權重不會為零:預測是不相關的，并且我們總是可以通過添加不相關性來減少隨機變量加權總和的方差變量(當然具有有限方差)。

Figure 1: Estimated vs Theoretical Optimal Weights圖1:估計的最佳權重與理論上的最佳權重

The solid blue line on the next figure plots the ensemble loss for the first 25 iterations of the algorithm. The dashed black and red lines represent, respectively, the loss achieved by the best single model and by a simple ensemble that averages the predictions of all models. After approximately five iterations the optimized ensemble beats the naive one achieving significantly lower MSE values thereafter.

下圖的藍色實線表示該算法前25次迭代的總體損失。黑色和紅色虛線分別代表最佳單一模型和將所有模型的預測取平均值的簡單集合所造成的損失。經過大約五次迭代后，優化的合奏擊敗了幼稚的合奏，此后獲得了明顯更低的MSE值。

Figure 2: Ensemble Loss vs Optimization Step圖2:整體損失與優化步驟

What if the number of models in the pool is very large?

如果池中的模型數量很大怎么辦？

If the model pool is very large some of the models could overfit the validation set purely by chance. Caruana et al. (2004) suggest using bagging to address this issue. In this case, the algorithm is applied to bags of M models randomly drawn from the pool with replacement with the final predictions being averaged over individual bags. For example, with a probability of 25% for a model to be drawn and 20 bags, the chance that any particular model will not be in any of the bags is only around 0.3%.

如果模型庫很大，則某些模型可能純粹是偶然地超出了驗證集的范圍。 Caruana等。 (2004)建議使用裝袋法解決這個問題。在這種情況下，該算法適用于從池中隨機抽取并替換的M個模型袋，最終預測值將對各個袋平均。例如，要抽取一個模型和20個袋子的概率為25％，則任何特定模型不在任何袋子中的機會僅為0.3％左右。

三，建立神經網絡集成體:MNIST示例 (III. Building Ensembles of Neural Networks: an MNIST Example)

Equipped with the techniques from the previous section, in this one, we will apply them to a realistic task, building and optimizing an ensemble of neural networks on the MNIST dataset. The results of this section can be completely replicated using the accompanying notebook, therefore I am restricting the code snippets in this section to a minimum primarily focusing on ensembling and not on model definitions and training.

在上一節中，配備了上一部分的技術，我們將把它們應用于實際任務，在MNIST數據集上建立和優化神經網絡的集成。本節的結果可以使用隨附的筆記本完全復制，因此，我將本節中的代碼片段限制在最低限度上，主要是集中在匯編而不是模型定義和訓練上。

We start with a simple MLP having 3 hidden layers of 100 units each with ReLU activations. Naturally, the input for the MNIST datset is a 28x28 pixels image flattened into a 784-dimensional vector, and the output layer has 10 units corresponding to the number of digits. Therefore, the architecture specified by the MNISTMLP class implemented in PyTorch looks as follows:

我們從一個簡單的MLP開始，該MLP具有3個100單位的隱藏層，每個隱藏層都帶有ReLU激活。自然，MNIST數據集的輸入是平整為784維向量的28x28像素圖像，并且輸出層具有10個與位數相對應的單位。因此，由MNISTMLP實現的MNISTMLP類指定的體系結構如下所示:

MNISTMLP(
(layers): Sequential(
(0): Linear(in_features=784, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=100, bias=True)
(5): ReLU()
(6): Linear(in_features=100, out_features=10, bias=True)
)
)

We then train 10 instances of the model with independent weight initializations (i.e. everything is identical except for the starting weights) for 3 epochs each with a batch size of 32 and a learning rate of 0.001, reserving 25% of the training set of 60,000 images for validation, with the final 10,000 images comprising the test set. The objective is to minimize the cross-entropy (equivalently, negative log-likelihood). Note, that only 3 epochs of training together with a rather small capacity of each model would likely result in underfitting the data, thus allowing to demonstrate the benefits of ensembling in a more dramatic fashion.

然后，我們針對獨立的權重初始化(即，除了初始權重外，其他所有權重相同)的10個模型實例訓練3個時期，每個時期的批次大小為32，學習率為0.001，保留了60,000張圖像的訓練集中的25％用于驗證，最后的10,000張圖像組成測試集。目的是最小化交叉熵(相當于負對數似然)。請注意，只有3個時期的訓練以及每個模型的較小能力可能會導致數據擬合不足，因此可以更戲劇化地展示集合的好處。

After the training is complete we restore the best checkpoints (by validation loss) of each of the 10 models. The left panel on the figure below shows the validation (in blue) and test (in orange) loss for each model named M0 through M9. Similarly, the right panel plots the validation and test accuracy.

訓練完成后，我們將恢復10個模型中每個模型的最佳檢查點(通過驗證損失)。下圖的左面板顯示了每個名為M0到M9模型的驗證(藍色)和測試(橙色)損失。同樣，右側面板顯示了驗證和測試準確性。

Figure 3: Validation and Test Set Performance by Model圖3:按模型的驗證和測試集性能

As expected, all models perform rather poorly with the best one, M7, achieving only 96.8% accuracy on the test set.

不出所料，所有模型的性能都相當差，最好的模型為M7 ，測試集的準確度僅為96.8％。

To build an optimal ensemble let us first call the ensemble_selector function defined in the previous section and then go over individual arguments in the context of the current problem:

為了構建最佳的集成，讓我們首先調用上一節中定義的ensemble_selector函數，然后在當前問題的上下文中遍歷各個參數:

y_hats_val is a dictionary with the model names as keys and predicted class probabilities for the validation set as items:

y_hats_val是一本字典，其模型名稱為鍵，而驗證集的預測類概率為項:

>>> y_hats_val["M0"].round(3)
array([[0. , 0. , 0. , ..., 0.998, 0. , 0.001],
[0. , 0.003, 0.995, ..., 0. , 0.001, 0. ],
[0. , 0. , 0. , ..., 0.004, 0. , 0.975],
...,
[0.999, 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 1. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0. , 0.007, 0. ]]) >>> y_hats_val["M7"].round(3)
array([[0. , 0. , 0. , ..., 1. , 0. , 0. ],
[0. , 0. , 1. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.003, 0. , 0.981],
...,
[0.997, 0. , 0.002, ..., 0. , 0. , 0. ],
[0. , 0. , 1. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0. , 0.002, 0. ]])

y_true_one_hot_val is a numpy array of the corresponding true one-hot encoded labels:

y_true_one_hot_val是對應的真正的一鍵編碼標簽的numpy數組:

>>>
array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])

The loss_function is a callable mapping arrays of predictions and labels to a scalar:

loss_function是預測和標簽到標量的可調用映射數組:

>>> cross_entropy(y_hats_val["M7"].round(3), y_true_one_hot_val)
0.010982255936197028

Finally, init_size=1 means that we start with an ensemble of a single model; replacement=True means that the models are not removed from the model pool after being added to the ensemble, allowing the algorithm to add the same model several times, thus adjusting the weights of the ensemble constituents; max_iter=10 sets the number of steps the algorithm takes.

最后， init_size=1意味著我們從單個模型的集合開始； replacement=True表示在將模型添加到集合后不會將其從模型池中刪除，從而允許算法多次添加相同的模型，從而調整集合成分的權重； max_iter=10設置算法采取的步驟數。

Let us now examine the outputs. model_weights is a pandas dataframe containing the ensemble weight of each model for each optimization step. Dropping all models that have zero weights at each optimization step yields:

現在讓我們檢查輸出。 model_weights是一個熊貓數據model_weights其中包含每個優化步驟每個模型的整體權重。在每個優化步驟中刪除權重為零的所有模型將得出:

>>> model_weights.loc[:, (model_weights != 0).any()] M1 M4 M5 M7 M9
0 0.000000 0.000000 0.000000 1.000000 0.000000
1 0.000000 0.000000 0.500000 0.500000 0.000000
2 0.000000 0.000000 0.333333 0.333333 0.333333
3 0.000000 0.250000 0.250000 0.250000 0.250000
4 0.200000 0.200000 0.200000 0.200000 0.200000
5 0.166667 0.166667 0.166667 0.333333 0.166667
6 0.142857 0.142857 0.285714 0.285714 0.142857
7 0.125000 0.250000 0.250000 0.250000 0.125000
8 0.111111 0.222222 0.222222 0.222222 0.222222
9 0.100000 0.200000 0.200000 0.300000 0.200000
10 0.181818 0.181818 0.181818 0.272727 0.181818

The following figure plots weights of the ensemble constituents as a function of optimization steps with a darker hue corresponding to a higher average weight a model receives during all optimization steps. The ensemble initializes with a single strongest model M7 at step 0, and then progressively adds more models assigning an equal weight to each: at step 1 there are two models M7 and M5 with a 50% weight each, at step 2 the ensemble includes models M7, M5 and M9 each having a weight of one third. After step 4 no new model can further improve ensemble predictions, and the algorithm starts to adjust the weights of its constituents.

下圖繪制了作為優化步驟的函數的整體成分的權重，其中較深的色相對應于模型在所有優化步驟中獲得的較高的平均權重。集成在第0步使用單個最強模型M7進行初始化，然后逐步添加更多的模型，為每個模型分配相同的權重:在第1步中，有兩個模型M7和M5各自的權重為50％，在第2步中，該集成包括模型M7 ， M5和M9各自具有三分之一的重量。在第4步之后，沒有新模型可以進一步改善總體預測，并且該算法開始調整其成分的權重。

Figure 4: Ensemble Weights圖4:整體配重

The other output — ensemble_loss — contains the loss of the ensemble at each optimization step. Similar to Figure 2 from the previous section, the left panel on the figure below plots the ensemble loss on the validation set (solid blue line) as the optimization progresses. The dashed black and red lines represent, respectively, the validation loss achieved by the best single model and by a simple ensemble which assigns equal weights to all models. The ensemble loss decreases quite rapidly, surpassing the performance of its simple counterpart after a couple of iterations and stabilizing after the algorithm enters the weight adjustment mode, which is hardly surprising given that the model pool is rather small. The right panel reports the results for the test set: at each iteration I use the current ensemble weights to produce predictions and measure loss on the test set. The ensemble generalizes well on the test sample effectively repeating the pattern observed on the validation set.

另一個輸出( ensemble_loss )包含每個優化步驟的整體損失。與上一節中的圖2相似，下圖的左面板在優化過程中繪制了驗證集上的整體損失(藍色實線)。黑色和紅色虛線分別代表最佳單一模型和簡單的集合(將所有模型分配相等的權重)所達到的驗證損失。整體損失的下降非常快，經過兩次迭代后，其性能超過了簡單的整體，并且在算法進入權重調整模式后趨于穩定，鑒于模型池很小，這不足為奇。右側面板報告了測試集的結果:在每次迭代中，我使用當前的集成權重來生成預測并測量測試集的損失。集合在測試樣本上很好地泛化，有效地重復了在驗證集上觀察到的模式。

Figure 5: Ensemble Loss, MNIST圖5:MNIST的整體損失

The Caruana et al. (2004) algorithm is very flexible and we can easily adapt ensemble_selector to, for instance, directly optimize the accuracy by changing the loss_function argument:

Caruana等。 (2004)算法非常靈活，例如，我們可以通過更改loss_function參數輕松地使ensemble_selector適應于直接優化精度:

where accuracy is defined as follows:

accuracy定義如下:

The following figure repeats the analysis in the previous one but this time for the validation and test accuracy. The conclusions are similar, although the accuracy path of the ensemble is more volatile in both samples.

下圖重復了上一個的分析，但是這次是為了驗證和測試準確性。結論是相似的，盡管在兩個樣本中集合的準確性路徑更加不穩定。

Figure 6: Ensemble Accuracy, MNIST圖6:MNIST的合奏精度

IV。有關神經網絡集成的更多信息:損失表面幾何的重要性 (IV. More on Ensembling in Neural Networks: Importance of Loss Surface Geometry)

Why do random initializations work?

為什么隨機初始化起作用？

The short answer — it is all about the loss surface. The current deep learning research emphasizes the importance of the optimization landscape. For instance, batch normalization (Ioffe and Szegedy (2015)) is traditionally thought to accelerate and regularize training by reducing internal covariate shift — the change in the distribution of network activations during training. However, Santurkar et al. (2018) provide a compelling argument that the success of the technique stems from another property: batch normalization makes the optimization landscape significantly smoother and thus stabilizes the gradients and speeds up training. In a similar vein, Keskar et al. (2016) argue that sharp minima on the loss surface have poor generalization properties in comparison with minima in flatter regions of the landscape.

簡短的答案-全部與損失表面有關。當前的深度學習研究強調了優化環境的重要性。例如，傳統上認為批量標準化( Ioffe和Szegedy(2015) )通過減少內部協變量漂移 (訓練過程中網絡激活分布的變化)來加速和規范化訓練。但是， Santurkar等。 (2018)提供了一個令人信服的論點，即該技術的成功源于另一個特性:批處理歸一化使優化環境變得更加平滑，從而穩定了梯度并加快了訓練速度。同樣， Keskar等人。 (2016年)認為，與景觀平坦區域中的極小值相比，損失表面上的極小值極少具有泛化特性。

During training a neural network can be viewed as a function mapping parameters to loss values given the training data. The figure below plots a (very) simplified illustration of the networks’ loss landscape: the space of solutions and loss are along the horizontal and vertical axes respectively. Each point on the x-axis represents all weights and biases of the network yielding the corresponding loss (the blue solid line). The red dots show local minima where we are likely to end up using gradient-based optimization (the two leftmost dots are the global minima).

在訓練期間，可以將神經網絡視為將參數映射到給定訓練數據的損耗值的函數。下圖繪制了網絡損失狀況的(非常)簡化圖示:解和損失的空間分別沿水平和垂直軸。 x軸上的每個點代表產生相應損失的網絡的所有權重和偏差(藍色實線)。紅點表示局部極小值，我們可能會使用基于梯度的優化來結束(最左邊的兩個點是全局極小值)。

In the context of ensembling, this means that we would like to explore many local minima. In the previous section we already saw that the combinations of different initializations of the same neural network architecture result in a superior generalization ability. In fact, in their recent paper Fort et al. (2019) demonstrate that random initializations end up in distant optima and therefore are capable of exploring completely different models with similar accuracy and relatively uncorrelated predictions thus forming strong ensemble components. This finding complements the standard intuition of neural networks being the ultimate low bias-high variance algorithms capable of fitting anything with almost surgical precision albeit plagued by their sensitivity to noise, and therefore benefiting from ensembling due to variance reduction.

在集合的背景下，這意味著我們想探索許多局部最小值。在上一節中，我們已經看到，相同神經網絡體系結構的不同初始化的組合產生了卓越的泛化能力。實際上，在他們最近的論文中Fort等人。 (2019)證明了隨機初始化最終會導致遙遠的最優解，因此能夠以相似的準確度和相對不相關的預測探索完全不同的模型，從而形成強大的整體成分。這一發現補充了神經網絡的標準直覺，后者是最終的低偏差-高方差算法，盡管由于其對噪聲的敏感性而困擾，但能夠以幾乎外科手術的精度擬合任何事物，因此受益于因方差減少而產生的集合。

But what to do if training several copies of the same model is infeasible?

但是，如果無法訓練同一模型的多個副本怎么辦？

Huang et al. (2018) propose building an ensemble during a single training run using cyclical learning rate with annealing and storing a model checkpoint, or snapshot, at the end of each cycle. Intuitively, increasing the learning rate could allow the model to escape any of the local minima on the figure above and land at the neighboring region with a different local minimum eventually converging to it with a subsequent decrease in the learning rate. The next figure illustrates the snapshot ensemble technique. The left plot shows the path over the loss landscape a model traverses during the standard training regime with a constant learning rate and the final stopping point marked with a blue flag. The right plot depicts the path with a cyclical learning rate schedule and periodic snapshots marked with red flags.

黃等。 (2018)建議在單次訓練中使用循環學習率和退火來構建整體，并在每個循環結束時存儲模型檢查點或快照。直觀地講，提高學習率可以使模型逃脫上圖中的任何局部最小值，并以不同的局部最小值降落到相鄰區域，最終收斂到其上，從而降低學習率。下圖說明了快照集成技術。左圖顯示了模型在標準訓練過程中以恒定的學習率經過的損失景觀上的路徑，并以藍色標記標記了最終的停止點。右圖描繪了帶有周期性學習率計劃和帶有紅色標記的周期性快照的路徑。

Snapshot Ensembles, Source: Huang et al. (2017)快照合奏，來源: Huang等。 (2017)

Remember, however, that on the two previous figures the whole parameter space is compressed into a single point on the x-axis and xy-plane respectively, meaning that a pair of neighboring points on the graph might be very far apart in the real parameter space, and therefore, the ability of a gradient descent algorithm to traverse multiple minima without getting stuck depends on whether the corresponding valleys on the loss surface are separated by regions of very high loss such that no meaningful increase in the learning would result in a transition to a new valley.

但是請記住，在前兩個圖上，整個參數空間分別被壓縮為x軸和xy平面上的單個點，這意味著圖形上的一對相鄰點在實際參數中可能相距很遠空間，因此，梯度下降算法遍歷多個極小值而不會被卡住的能力取決于損耗表面上相應的波谷是否被損耗非常高的區域分隔開，從而使學習中沒有有意義的增加不會導致過渡到一個新的山谷。

Fortunately, Garipov et al. (2018) demonstrate that there exist low-loss paths connecting the local minima on the optimization landscape and propose the fast geometric ensembling (FGE) procedure exploiting these connections. Izmailov et al. (2018) propose a further refinement of FGE — stochastic weight averaging (SWA).

幸運的是，加里波夫等人。 (2018)證明在優化景觀上存在連接局部極小值的低損耗路徑，并提出了利用這些連接的快速幾何集合(FGE)程序。伊茲麥洛夫等。 (2018)提出了對FGE的進一步改進-隨機加權平均(SWA)。

Max Pechyonkin provides an excellent overview of snapshot ensembles, FGE, and SWA.

Max Pechyonkin 很好地概述了快照合奏，FGE和SWA。

五，結論 (V. Conclusion)

Let us recap the key takeaways of this post:

讓我們回顧一下這篇文章的主要內容:

Strong ensembles consist of models that are both accurate and diverse
強大的合奏包含準確且多樣的模型
Some model-free ensemble methods like the Caruana et al. (2004) algorithm admit realistic target functions which are not suitable as optimization objectives for ML models
一些無模型的集成方法，例如Caruana等人。 (2004年)算法接受了不適合作為ML模型優化目標的實際目標函數
Ensembling improves the performance of neural networks not only by dampening their inherent sensitivity to noise but also by combining qualitatively different and uncorrelated solutions
集成不僅可以降低神經網絡對噪聲的固有敏感性，而且可以結合質上不同且不相關的解決方案，從而提高了神經網絡的性能。

To sum up, ensemble learning techniques should arguably be among the most important tools in the arsenal of every machine learning practitioner. It is indeed quite fascinating how far the old adage of not putting all eggs in one basket goes.

綜上所述，集成學習技術可以說是每位機器學習從業人員中最重要的工具之一。確實令人著迷的是，沒有把所有雞蛋都放在一個籃子里的古老格言走了多遠。

Thank you for reading. Comments and feedback are eagerly anticipated. Also, connect with me on LinkedIn.

感謝您的閱讀。迫切需要評論和反饋。另外，在 LinkedIn 上與我聯系。

進一步閱讀 (Further reading)

翻譯自: https://towardsdatascience.com/ensembles-the-almost-free-lunch-in-machine-learning-91af7ebe5090

弦樂合奏音源

總結

以上是生活随笔為你收集整理的合奏：机器学习中唯一（几乎）免费的午餐的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：比亚迪：目前研发团队已全面覆盖各个电池技
下一篇： pytorch机器学习_机器学习— Py