fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数
fitbit手表中文說明書
In Part 1 of this article I explained how we can obtain sleep data from Fitbit, load it into Python and preprocess the data to be ready for further analysis. In this part I will explain how and why we split the data into training, validation and test set, how we can select features for our Machine Learning models and then train three different models: Multiple Linear Regression, Random Forest Regressor and Extreme Gradient Boosting Regressor. I will briefly explain how these models work and define performance measures to compare their performance. Let’s get started.
在本文的第1部分中 ,我解釋了如何從Fitbit獲取睡眠數據,將其加載到Python中并進行預處理,以準備進行進一步的分析。 在這一部分中,我將說明我們如何以及為什么將數據分為訓練,驗證和測試集,如何為機器學習模型選擇功能,然后訓練三種不同的模型:多重線性回歸,隨機森林回歸和極限梯度提升回歸。 我將簡要解釋這些模型如何工作,并定義績效指標以比較其績效。 讓我們開始吧。
將數據分為訓練,驗證和測試集 (Separating the data into training, validation and test set)
Before we do any further analysis using our data we need to split the entire data set into three different subsets: training set, validation set and test set. The following image displays this process well:
在使用數據進行任何進一步分析之前,我們需要將整個數據集分為三個不同的子集:訓練集,驗證集和測試集。 下圖很好地顯示了此過程:
Training, Validation and Testing data培訓,驗證和測試數據The test set is also referred to as hold-out set and once we split it from the remaining data we do not touch it again until we have trained and tweaked our Machine Learning models to a point where we think they will perform well on data that they have never seen before.
測試集也稱為保持集,一旦我們將其與剩余數據分離,我們就不會再碰它,直到我們對機器學習模型進行了訓練和調整,以至于我們認為它們將在對他們從未見過。
We split the remaining data into a training and a validation set. This allows us to train our models on the training data and then evaluate their performance on the validation data. In theory, we can then go and tweak our models and evaluate them on the validation data again and thereby find ways to improve model performance. This process often leads to overfitting, meaning that we focus too much on training our model in a way that it performs well on the validation set but it performs poorly when used on a data set that it has never seen before (such as the test set).
我們將剩余的數據分為訓練和驗證集。 這使我們可以在訓練數據上訓練我們的模型,然后在驗證數據上評估它們的性能。 從理論上講,我們可以去調整模型并再次在驗證數據上對其進行評估,從而找到提高模型性能的方法。 此過程通常會導致過度擬合,這意味著我們過分專注于訓練模型,使得模型在驗證集上表現良好,但在從未使用過的數據集(例如測試集)上使用時,效果卻很差)。
In part 3 of this article I explain how we can reduce overfitting while making sure that the models still perform well. For now, we will follow the above approach of a simple split of the data set into training, validation and test set.
在本文的第3部分中,我解釋了如何在確保模型仍然運行良好的同時減少過度擬合。 現在,我們將采用上述方法,將數據集簡單地分為訓練,驗證和測試集。
I want to split the data in a way that the training set is made up of 60% of the total data set and the validation and test set are both made up of 20%. This code achieves the correct percentage splits:
我想以訓練集由總數據集的60%組成,而驗證和測試集均由20%組成的方式來拆分數據。 這段代碼實現了正確的百分比分割:
In the first test split the test_size parameter is set to 0.2, which splits the data into 80% training data and 20% test data. In order to split the 80% training data into training and validation data and ensuring that the validation data is 20% of the size of the original data set the test_size parameter needs to be 0.25 (20% is one quarter, or 0.25, of 80%).
在第一次測試拆分中,將test_size參數設置為0.2,這會將數據拆分為80%的訓練數據和20%的測試數據。 為了將80%的訓練數據分為訓練和驗證數據,并確保驗證數據是原始數據集大小的20%,test_size參數需要為0.25(20%是80的四分之一或0.25) %)。
Before moving on I want to emphasise one important thing here. It is crucial to split the data before performing any further transformations such as scaling the data because we want to prevent any information about the test set to spill over into our training and validation set. Data scaling is often done using statistics about the data set as a whole, such as mean and standard deviation. Because we want to be able to measure how well our Machine Learning models perform on data they have never seen before we have to make sure that no information from the test data impacts how the scaling or any other transformation is done.
在繼續之前,我想在這里強調一件事。 在執行任何進一步的轉換(例如縮放數據)之前,請先拆分數據,因為我們希望防止有關測試集的任何信息溢出到我們的訓練和驗證集中。 通常使用有關整個數據集的統計信息(例如均值和標準差)來完成數據縮放。 因為我們希望能夠衡量我們的機器學習模型在從未見過的數據上的性能,因此我們必須確保測試數據中的任何信息都不會影響縮放或任何其他轉換的方式。
擴展功能,定義性能指標和基準 (Scaling features, defining performance metrics and a baseline)
Although for the Machine Learning models in this project feature scaling is not required, it is considered best practice to scale features when comparing different models and their performance.
盡管對于該項目中的機器學習模型而言,不需要進行特征縮放,但是在比較不同模型及其性能時,將特征縮放是最佳實踐。
In this code, I use MinMaxScaler, which I fit on the training data and then use to scale the training, validation and test data:
在此代碼中,我使用MinMaxScaler,將其適合訓練數據,然后用于縮放訓練,驗證和測試數據:
績效指標 (Performance measures)
Next, let’s define some performance measures that we can use to evaluate our models and compare them. Because Sleep Scores are a continuous variable (although only integer Sleep Scores are possible) the problem at hand is a regression problem. For regression problems there are many different measures of performance and in this analysis I will use Mean Absolute Error, Mean Squared Error and R-Squared. Additionally, I compute an accuracy of the predictions of the models.
接下來,讓我們定義一些性能指標,這些指標可用于評估模型并進行比較。 由于睡眠分數是一個連續變量(盡管只能使用整數睡眠分數),所以當前的問題是回歸問題。 對于回歸問題,有許多不同的性能指標,在此分析中,我將使用均值絕對誤差,均方誤差和R平方。 另外,我計算模型預測的準確性。
Accuracy is typically used as a performance measure in classification problems and not regression problems because it refers to the proportion of correct predictions that the model makes. The ways I use accuracy for the regression models in this analysis is different. Accuracy for the regression models is a measure of how far off (in percentage terms) the predicted Sleep Score will be from the actual Sleep Score, on average. For example, if the actual sleep score is 80 and the model has an accuracy of 96%, meaning that on average it is 4% off, the model is expected to make a prediction for the sleep score in the range of 76.8 (80 — (80 x 0.04)) to 83.2 (80 + (80 x 0.04)).
準確性通常用作分類問題而不是回歸問題的性能指標,因為它指的是模型做出的正確預測的比例。 在此分析中,我對回歸模型使用準確性的方式有所不同。 回歸模型的準確性是對預測睡眠得分與實際睡眠得分的平均差距(以百分比表示)的度量。 例如,如果實際睡眠得分為80,而模型的準確度為96%,即平均降低了4%,則該模型可以對睡眠得分做出76.8的預測(80 – (80 x 0.04))至83.2(80 +(80 x 0.04))。
Here is the function that evaluates a model’s performance that takes as inputs the model at hand, the test features and the test labels:
這是評估模型性能的函數,該函數將手頭模型,測試功能和測試標簽作為輸入:
But how do we know what scores are good or bad for these different measures. For example, is an accuracy of 90% good or bad? What about R-squared? In order to have a reference point we will first come up with a baseline model that we can compare all later models and their performance to.
但是我們如何知道這些不同衡量標準的好壞。 例如,精度為90%是好是壞? R平方呢? 為了獲得參考點,我們將首先提出一個基準模型,我們可以將所有后續模型及其性能進行比較。
基準表現 (Baseline performance)
In order to evaluate the Machine Learning models we are about to build we want to have a baseline that we can compare their performance to. Generally, a baseline is a simplistic approach that generates predictions based on a simple rule. For our analysis, the baseline model always predicts the median Sleep Score of the training set. If our Machine Learning model is not able to outperform this simple baseline it would be rather useless.
為了評估我們將要建立的機器學習模型,我們希望有一個基準,我們可以將其性能與之進行比較。 通常,基線是一種基于簡單規則生成預測的簡化方法。 對于我們的分析,基線模型總是預測訓練集的中位睡眠得分。 如果我們的機器學習模型無法勝過這個簡單的基準,那將毫無用處。
Let’s see what the performance of the baseline looks like:
讓我們看看基線的性能如何:
While the accuracy may seem decent, looking at the other performance measures tells a very different story. The R-squared is negative, which is a strong indication of extremely poor model performance.
盡管準確性似乎不錯,但查看其他性能指標卻得出了截然不同的故事。 R平方為負,這強烈表明模型性能極差。
Now that we have split our data into different sub sets, have scaled the features, defined performance metrics and have come up with a baseline model we are almost ready to start training and evaluating our Machine Learning models. Before we move on to our models let’s first select the features that we want to use in those models.
現在,我們已經將數據分為不同的子集,擴展了功能,定義了性能指標,并提出了一個基線模型,我們幾乎準備開始訓練和評估我們的機器學習模型。 在繼續進行模型之前,我們首先選擇要在這些模型中使用的功能。
使用套索回歸進行特征選擇 (Feature Selection using Lasso Regression)
There are two questions that you might have after reading that heading: Why do we need to select features and what the hell is Lasso Regression?
閱讀該標題后,您可能會遇到兩個問題:為什么我們需要選擇要素以及Lasso回歸到底是什么?
功能選擇 (Feature Selection)
There are multiple reasons for selecting only a subset of the available features.
選擇多個可用功能的子集有多種原因。
Firstly, feature selection enables the Machine Learning algorithm to train faster because it is using less data. Secondly, it reduces model complexity and makes it easier to interpret the model. In our case this will be important because apart from predicting Sleep Scores accurately we also want to be able to understand how the different features impact the Sleep Score. Thirdly, feature selection can reduce overfitting and thereby improve the prediction performance of the model.
首先,特征選擇使機器學習算法可以訓練得更快,因為它使用的數據較少。 其次,它降低了模型的復雜性,并使模型的解釋更加容易。 在我們的案例中,這很重要,因為除了準確預測睡眠分數外,我們還希望能夠了解不同功能如何影響睡眠分數。 第三,特征選擇可以減少過度擬合,從而提高模型的預測性能。
In part 1 of this article we saw that many of the features in the sleep data set are highly correlated, meaning that the more features we use the more multicollinearity will be present in the model. This is generally speaking not an issue if we only care about prediction performance of the model but it is an issue if we want to be able to interpret the model. Feature selection will also help reduce some of that multicollinearity.
在本文的第1部分中,我們看到了睡眠數據集中的許多特征是高度相關的,這意味著我們使用的特征越多,多重共線性就會出現在模型中。 一般來說,如果我們只關心模型的預測性能,這不是問題,但是如果我們希望能夠解釋模型,則不是問題。 特征選擇還將有助于減少某些多重共線性。
For more information on feature selection see this article.
對于特征選擇的更多信息,請參閱該文章。
套索回歸 (Lasso Regression)
Before we move on to Lasso Regression let’s briefly recap what a linear regression does. Fitting a linear regression minimises a loss function by choosing coefficients for each feature variable. One problem with that is that large coefficients can lead to overfitting, meaning that the model will perform well on the training data but poorly on data it has never seen before. This is where regularisation comes in.
在繼續進行Lasso回歸之前,讓我們簡要回顧一下線性回歸的作用。 擬合線性回歸可以通過為每個特征變量選擇系數來最小化損失函數。 這樣做的一個問題是,較大的系數會導致過擬合,這意味著該模型在訓練數據上表現良好,但在從未見過的數據上表現較差。 這就是正則化的地方。
Lasso Regression is a type of regularisation regression that penalises the absolute size of the regression coefficients through an additional term in the loss function. The loss function for a Lasso regression can be written like this:
Lasso回歸是一種正則化回歸,它通過損失函數中的附加項來懲罰回歸系數的絕對大小。 拉索回歸的損失函數可以這樣寫:
Loss function for Lasso Regression套索回歸的損失函數The first part of the loss function is equivalent to the loss function of a linear regression, which minimises the sum of squared residuals. The additional part is the penalty term, which penalises the absolute value of the coefficients. Mathematically, this is equivalent to minimising the sum of squared residuals with the constraint that the sum of absolute coefficient values has to be less than a prespecified parameter. This parameter determines the amount of regularisation and causes some coefficients to be shrunk to close to, or exactly, zero.
損失函數的第一部分等效于線性回歸的損失函數,它使殘差平方和最小。 附加部分是懲罰項,它懲罰系數的絕對值。 從數學上講,這等效于最小化殘差平方和,并具有絕對系數值之和必須小于預定參數的約束。 此參數確定正則化的數量,并使某些系數縮小到接近或恰好為零。
In the above equation, λ is the tuning parameter which determines the strength of the penalty, i.e. the amount of shrinkage. Setting λ=0 would result in the loss function for a linear regression and as λ increases, more and more coefficients are set to zero and the remaining coefficients are therefore “selected” by the Lasso Regression as being important.
在上式中,λ是確定損失強度(即收縮量)的調整參數。 設置λ= 0將導致線性回歸的損失函數,并且隨著λ的增加,越來越多的系數被設置為零,因此剩余的系數被拉索回歸“選擇”為重要的。
Fitting a Lasso regression on the training data and plotting the resulting coefficients looks like this:
在訓練數據上擬合Lasso回歸并繪制得出的系數如下所示:
The Lasso Regression algorithm has reduced the coefficients of Time in Bed and Minutes Light Sleep to close to zero, deeming them less important than the other four features. This comes in handy as we would face major multicollinearity issues if we included all of the features in our models. Let’s drop these two features from our data sets:
Lasso回歸算法已將“床上”和“分鐘睡眠時間”的時間系數減小到接近零,認為它們不如其他四個功能重要。 這很方便,因為如果我們在模型中包含所有功能,我們將面臨主要的多重共線性問題。 讓我們從數據集中刪除這兩個功能:
Now that we have selected a set of four features we can move on to building some Machine Learning models that will use those four features to predict Sleep Scores.
現在,我們選擇了四個功能集,我們可以繼續構建一些機器學習模型,這些模型將使用這四個功能來預測睡眠分數。
多元線性回歸 (Multiple Linear Regression)
In summary, Multiple Linear Regression (MLR) is used to estimate the relationship between one dependent variable and two or more independent variables. In our case, it will be used to estimate the relationship between Sleep Score and Minutes Asleep, Minutes Awake, Minutes REM Sleep and Minutes Deep Sleep. Note that MLR assumes that the relationship between these variables is linear.
總之,多元線性回歸(MLR)用于估計一個因變量和兩個或多個自變量之間的關系。 在我們的情況下,它將用于估計睡眠分數與分鐘睡眠,分鐘清醒,分鐘REM睡眠和分鐘深度睡眠之間的關系。 注意,MLR假定這些變量之間的關系是線性的。
Let’s train a MLR model and evaluate its performance:
讓我們訓練一個MLR模型并評估其性能:
All performance measures are substantially better than those of the baseline model (thank god). Especially the accuracy seems to be really high but this can be misleading, which is why it is important to consider multiple measures. One of the most important measures for regression performance is the R-squared. Generally speaking, the R-squared measures the proportion of the variance of the dependent variable that is explained by the independent variables. Hence, in our case it is a measure of how much of the variance in Sleep Scores is explained by our features. A value of roughly 0.76 is decent already but let’s see if we can do better by using different models.
所有性能指標均明顯優于基準模型(感謝上帝)。 尤其是準確性似乎確實很高,但這可能會產生誤導,這就是為什么考慮多種測量方法很重要的原因。 R平方是回歸性能最重要的指標之一。 一般而言,R平方測量因變量的方差比例,由自變量解釋。 因此,在我們的情況下,這是衡量我們的功能解釋了睡眠得分差異的多少的指標。 大約0.76的值已經不錯了,但讓我們看看是否可以通過使用不同的模型來做得更好。
回歸統計 (Regression statistics)
Before we move on to other Machine Learning models I would like to take a look at the regression output for the Multiple Linear Regression on our training data:
在繼續學習其他機器學習模型之前,我想看一下我們訓練數據上多元線性回歸的回歸輸出:
A few things to note regarding the regression output:
有關回歸輸出的一些注意事項:
The regression output provides a good starting point for understanding how the different sleep statistics may affect Sleep Score. More time asleep increases Sleep Score. This makes sense because more sleep (up until a certain point) will generally be beneficial. Similarly, more time spent in REM and Deep Sleep increase the Sleep Score as well. This also makes sense because both of these sleep stages provide important restorative benefits. For the computation of Sleep Score, Fitbit seems to consider REM sleep to be more important than Deep sleep (higher magnitude of the coefficient), which to me is one of the most interesting outcomes of the regression analysis. Finally, more time awake decreases Sleep Score. Again, that makes perfect sense because spending more time awake during one’s sleep window indicates restlessness and takes away from the restorative powers that time spent asleep provides.
回歸輸出為理解不同睡眠統計信息如何影響睡眠評分提供了一個很好的起點。 更多的睡眠時間會增加睡眠得分。 這是有道理的,因為多睡(直到一定時間)通常是有益的。 同樣,花在REM和深度睡眠上的時間也增加了睡眠得分。 這也很有意義,因為這兩個睡眠階段都提供了重要的恢復性益處。 對于睡眠得分的計算,Fitbit認為REM睡眠比深度睡眠(深度系數更高)更重要,這對我而言是回歸分析中最有趣的結果之一。 最后,更多的時間清醒會降低睡眠得分。 再說一次,這完全是有道理的,因為在一個人的睡眠窗口中花更多的時間醒著表示不安,并擺脫了睡眠所提供的恢復能力。
For those people that are interested in understanding the importance of different sleep stages and of sleep in general, I highly recommend “Why We Sleep” by Matthew Walker. It is a brilliantly written book with fascinating experiments and insights!
對于那些有興趣了解不同睡眠階段和一般睡眠重要性的人,我強烈建議馬修·沃克(Matthew Walker)發表“為什么我們要睡覺” 。 這是一本精彩的書,具有引人入勝的實驗和見解!
All that being said, it is important to note that the interpretability of the above output is somewhat limited because of the correlation that is present between features. In Multiple Linear Regression, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one unit, holding all the other independent variables constant. In our case, because the independent variables are correlated, we could not expect one variable to change without the others changing and therefore cannot reliably interpret the coefficients in this way. Always look out for multicollinearity when interpreting your models!
綜上所述,重要的是要注意,由于特征之間存在相關性,因此上述輸出的可解釋性受到一定程度的限制。 在“多元線性回歸”中,系數告訴您當該自變量增加一個單位并保持所有其他自變量不變時,該自變量期望增加多少。 在我們的案例中,因為自變量是相關的,所以我們不能期望一個變量在不改變其他變量的情況下發生變化,因此無法以這種方式可靠地解釋系數。 解釋模型時,請始終注意多重共線性!
Let’s see if other Machine Learning models perform better than Multiple Linear Regression.
讓我們看看其他機器學習模型是否比多重線性回歸更好。
隨機森林回歸 (Random Forest Regressor)
Random Forests are one of the most popular Machine Learning models because of their ability to perform well on both classification and regression problems. In summary, a Random Forest is an ensemble technique that leverages multiple decision trees through Bootstrap Aggregation, also called “bagging”. What exactly does that mean?
隨機森林是最流行的機器學習模型之一,因為它們在分類和回歸問題上都能表現出色。 總之,隨機森林是一種集成技術,它通過Bootstrap聚合(也稱為“裝袋”)利用多個決策樹。 這到底是什么意思呢?
In order to understand this better we first need to understand how Decision Tree Regression works.
為了更好地理解這一點,我們首先需要了解決策樹回歸的工作方式。
決策樹回歸 (Decision Tree Regression)
As the name suggests, decision trees build prediction models in form of a tree structure that may look like this:
顧名思義,決策樹以樹結構的形式構建預測模型,如下所示:
Decision Tree for predicting hours played用于預測比賽時間的決策樹In the above example the decision tree iteratively splits the data set based on various features in order to come up with a prediction of how many hours will be spent playing. But how does the tree know what features to split on first and which ones to split on further down the tree? After all, the predictions could be different if we change the sequence of the features used to make the split.
在上面的示例中,決策樹基于各種功能迭代地拆分數據集,以便預測將花費多少小時進行游戲。 但是,樹如何知道首先要分割的特征以及要進一步分解的特征? 畢竟,如果我們更改用于進行分割的特征的順序,則預測可能會有所不同。
In a regression problem, the most common way to decide what feature to split the dataset on at a specific node is Mean Squared Error (MSE). The decision tree tries out different features that it can use to split the data set and computes the resulting MSEs. The feature that leads to the lowest MSE is chosen for the split at hand. This process is continued until the tree reaches a leaf (an end point) or a predetermined maximum depth. Maximum depths can be used to reduce overfitting because if a decision tree is allowed to continue until it finds a leaf, it may strongly overfit to the training data. Using maximum depths in this way is referred to as “pruning” of the tree.
在回歸問題中,決定在特定節點上分割數據集的特征的最常見方法是均方誤差(MSE)。 決策樹嘗試了可用于拆分數據集并計算生成的MSE的不同功能。 將導致最低MSE的功能選擇為當前拆分。 繼續該過程,直到樹到達葉子(終點)或預定的最大深度為止。 最大深度可用于減少過度擬合,因為如果允許決策樹繼續直到找到葉子,則可能會嚴重過度擬合訓練數據。 以這種方式使用最大深度被稱為樹的“修剪”。
There are two major limitations with decision trees:
決策樹有兩個主要限制:
Random Forests address both of those limitations.
隨機森林解決了這兩個限制。
隨機森林 (Random Forests)
As the “Forest” in Random Forest suggests, they are made up of many decision trees and their predictions are made by averaging the predictions of each decision tree in the forest. Think of this as a Democracy. Having only one person vote on an important issue may not be representative of how the entire community really feels, but collecting votes from many randomly selected members of the community may provide an accurate representation.
正如《隨機森林》中的“森林”所暗示的那樣,它們由許多決策樹組成,并且其預測是通過對森林中每個決策樹的預測取平均來做出的。 認為這是民主。 在重要問題上只有一個人投票可能無法代表整個社區的真實感受,但是從社區中許多隨機選擇的成員那里收集選票可能會提供準確的表示。
But what exactly does the “Random” in Random Forest represent?
但是“隨機森林”中的“隨機”到底代表什么?
In a Random Forest, every decision tree is created using a randomly chosen subset of the data points in the training set. This way every tree is different but all trees are still created from a portion of the same training data. Subsets are randomly selected with replacement, meaning that data points are “put back in the bag” and can be picked again for another decision tree.
在隨機森林中,每個決策樹都是使用訓練集中數據點的隨機選擇子集創建的。 這樣,每棵樹都是不同的,但是所有樹仍然是從相同訓練數據的一部分中創建的。 子集是通過替換隨機選擇的,這意味著數據點被“放回包中”,并且可以再次用于其他決策樹。
In addition to choosing different random subsets for each tree, the decision trees in a Random Forest only consider a subset of randomly selected features at each split. The best feature is chosen for the split at hand and at the next node, a new set of random features is evaluated, etc.
除了為每棵樹選擇不同的隨機子集外,隨機森林中的決策樹在每個分割處僅考慮隨機選擇特征的子集。 為手頭和下一個節點的分割選擇最佳特征,評估一組新的隨機特征,依此類推。
By constructing decision trees using these “bagging” techniques, Random Forests address the limitations of individual decision trees well and manage to turn what would be a weak predictor in isolation into a strong predictor in a group, similar to the voting example.
通過使用這些“套袋”技術構建決策樹,Random Forests很好地解決了各個決策樹的局限性,并設法將孤立的弱預測變量變成一組中的強預測變量,類似于投票示例。
Python中的隨機森林回歸 (Random Forest Regression in Python)
Using the scikit-learn library in Python, most Machine Learning models are built in the same way. First, you initiate the model, then you train it on the training set and then evaluate it on the validation set. Here is the code:
使用Python中的scikit-learn庫,大多數機器學習模型都是以相同的方式構建的。 首先,啟動模型,然后在訓練集上對其進行訓練,然后在驗證集上對其進行評估。 這是代碼:
Similar to the Multiple Linear Regression, the Random Forest performs vastly better than the baseline model. That being said, its R-squared and accuracy are lower than that of the MLR. So, what is all the hype around Random Forests about?
與多重線性回歸相似,隨機森林的性能大大優于基線模型。 話雖如此,其R平方和準確性低于MLR。 那么,關于隨機森林的所有炒作是什么?
The answer to that question can be found here (hint: Hyperparameter Optimisation):
可以在以下位置找到該問題的答案(提示:超參數優化):
極梯度提升回歸器 (Extreme Gradient Boosting Regressor)
Similar to Random Forests, Gradient Boosting is an ensemble learner, meaning that it creates a final model based on a collection of individual models, usually decision trees. What is different in the case of Gradient Boosting compared to Random Forests is the type of ensemble method. Random Forests use “Bagging” (described previously) and Gradient Boosting uses “Boosting”.
與“隨機森林”相似,“梯度增強”是一個整體學習器,這意味著它基于單個模型(通常是決策樹)的集合來創建最終模型。 與“隨機森林”相比,“梯度增強”的區別在于集成方法的類型。 隨機森林使用“裝袋”(如前所述),而漸變增強使用“升壓”。
梯度提升 (Gradient Boosting)
The general idea behind Gradient Boosting is that the individual models are built sequentially by putting more weight on instances with wrong predictions and high errors. The model therefore “learns from its past mistakes”.
漸變增強的基本思想是,通過對具有錯誤預測和高錯誤的實例進行更多權重來依次構建各個模型。 因此,該模型“從過去的錯誤中學習”。
The model minimises a cost function through gradient descent. In each round of training, the weak learner (decision tree) makes a prediction, which is compared to the actual outcome. The distance between prediction and actual outcome represents the error of the model. The errors can then be used to calculate the gradient, i.e. the partial derivative of the loss function, to figure out in which direction to change the model parameters in order to reduce the error. The below graph visualises how this works:
該模型通過梯度下降最小化成本函數。 在每一輪培訓中,學習能力較弱的人(決策樹)都會做出預測,并將其與實際結果進行比較。 預測和實際結果之間的距離代表模型的誤差。 然后可以使用誤差來計算梯度,即損失函數的偏導數,以找出在哪個方向上更改模型參數以減小誤差。 下圖顯示了它是如何工作的:
Gradient descent梯度下降The rate with which these adjustments will be made (“Incremental Step” in the above graph) can be set through the hyperparameter “learning rate”.
可以通過超參數“學習率”來設置進行這些調整的速率(上圖中的“增量步長”)。
極端梯度提升 (Extreme Gradient Boosting)
Extreme Gradient Boosting improves upon Gradient Boosting by computing the second partial derivative of the cost function, which aids in getting to the minimum of the cost function, as well as using advanced regularisation similar to that described using Lasso Regression, which improves model generalisation.
通過計算成本函數的二階偏導數,極端梯度增強在梯度增強的基礎上進行了改進,這有助于使成本函數達到最小,并使用類似于使用拉索回歸描述的高級正則化方法來改善模型泛化。
In Python, training and evaluating Extreme Gradient Boosting Regressor follows the same fitting and scoring process as the Random Forest Regressor:
在Python中,訓練和評估Extreme Gradient Boosting Regressor遵循與Random Forest Regressor相同的擬合和評分過程:
The performance metrics are extremely close to that of the Random Forest, i.e. it performs decently but still not as well as our good old Multiple Linear Regression.
性能指標非常接近隨機森林的指標,即它的性能不錯,但仍然不及我們以前良好的多元線性回歸。
在這里去哪里? (Where to go form here?)
So far, we have not provided any hyperparameters in the Random Forest or Extreme Gradient Boosting Regressor. The respective libraries provide sensible default values for the hyperparameters of each model but there is no one-size-fits-all. By tweaking some of the hyperparameters we could potentially greatly improve the performance of these two models.
到目前為止,我們尚未在“隨機森林”或“極端梯度增強回歸”中提供任何超參數。 各個庫為每個模型的超參數提供了合理的默認值,但沒有“一刀切”的功能。 通過調整一些超參數,我們有可能極大地改善這兩個模型的性能。
Furthermore, for our performance evaluation so far we have only relied on the models’ performances on one relatively small validation set. The performance is therefore highly dependent on how representative this validation set is of sleep data as a whole.
此外,到目前為止,對于我們的性能評估,我們僅依靠一個相對較小的驗證集上的模型性能。 因此,性能高度取決于此驗證集作為整體睡眠數據的代表性。
In the third part of this article I address both of these issues and boost the performance of the Random Forest and the Extreme Gradient Boosting Regressor. See here:
在本文的第三部分中,我將同時解決這兩個問題,并提高隨機森林和極限梯度提升回歸器的性能。 看這里:
翻譯自: https://towardsdatascience.com/using-machine-learning-to-predict-fitbit-sleep-scores-496a7d9ec48
fitbit手表中文說明書
總結
以上是生活随笔為你收集整理的fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: iphone8内存是2还是3
- 下一篇: redis生产环境持久化_在SageMa