机器学习 处理不平衡数据_在机器学习中处理不平衡数据
機器學習 處理不平衡數據
As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation where you have hundreds of records for one class label and thousands of records for another class label.
作為ML工程師或數據科學家,有時您不可避免地會遇到這樣的情況:一個類標簽有數百條記錄,而另一個類標簽有數千條記錄。
Upon training your model you obtain an accuracy above 90%. You then realize that the model is predicting everything as if it’s in the class with the majority of records. Excellent examples of this are fraud detection problems and churn prediction problems, where the majority of the records are in the negative class. What do you do in such a scenario? That will be the focus of this post.
訓練模型后,您可以獲得90%以上的準確性。 然后,您會意識到該模型正在預測所有內容,就好像它屬于具有大部分記錄的類一樣。 欺詐檢測問題和客戶流失預測問題就是一個很好的例子,其中大多數記錄為負類。 在這種情況下您會做什么? 這將是這篇文章的重點。
收集更多數據 (Collect More Data)
The most straightforward and obvious thing to do is to collect more data, especially data points on the minority class. This will obviously improve the performance of the model. However, this is not always possible. Apart from the cost one would have to incur, sometimes it's not feasible to collect more data. For example, in the case of churn prediction and fraud detection, you can’t just wait for more incidences to occur so that you can collect more data.
最直接,最明顯的方法是收集更多數據,尤其是有關少數群體的數據點。 這顯然會改善模型的性能。 但是,這并不總是可能的。 除了必須承擔的費用外,有時收集更多數據也不可行。 例如,對于流失預測和欺詐檢測,您不能僅等待發生更多的事件以收集更多的數據。
考慮精度以外的指標 (Consider Metrics Other than Accuracy)
Accuracy is not a good way to measure the performance of a model where the class labels are imbalanced. In this case, it's prudent to consider other metrics such as precision, recall, Area Under the Curve (AUC) — just to mention a few.
精度不是衡量類標簽不平衡的模型性能的好方法。 在這種情況下,請謹慎考慮其他指標,例如精度,召回率,曲線下面積(AUC)-僅舉幾例。
Precision measures the ratio of the true positives among all the samples that were predicted as true positives and false positives. For example, out of the number of people our model predicted would churn, how many actually churned?
精度測量所有被預測為真陽性和假陽性的樣本中真陽性的比率。 例如,在我們的模型預測的流失人數中,實際上有多少人會流失?
Recall measures the ratio of the true positives from the sum of the true positives and the false negatives. For example, the percentage of people who churned that our model predicted would churn.
召回率衡量的是真實肯定與錯誤肯定的總和。 例如,我們的模型預測的會攪動的人群會流失。
The AUC is obtained from the Receiver Operating Characteristics (ROC) curve. The curve is obtained by plotting the true positive rate against the false positive rate. The false positive rate is obtained by dividing the false positives by the sum of the false positives and the true negatives.
AUC從接收器工作特性(ROC)曲線獲得。 通過繪制真實的陽性率對假陽性率來獲得曲線。 誤報率是通過將誤報除以誤報和真實否定之和得出的。
AUC closer to one is better, since it indicates that the model is able to find the true positives.
AUC越接近一個越好,因為它表明該模型能夠找到真實的陽性結果。
Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.
機器學習正Swift向收集數據的地方(邊緣設備)靠近。 訂閱Fritz AI新聞通訊以了解有關此過渡及其如何幫助您擴展業務的更多信息 。
強調少數民族階層 (Emphasize the Minority Class)
Another way to deal with imbalanced data is to have your model focus on the minority class. This can be done by computing the class weights. The model will focus on the class with a higher weight. Eventually, the model will be able to learn equally from both classes. The weights can be computed with the help of scikit-learn.
處理不平衡數據的另一種方法是讓模型關注少數群體。 這可以通過計算類權重來完成。 該模型將重點關注權重較高的課程。 最終,該模型將能夠從兩個類中平均學習。 權重可以借助scikit-learn進行計算。
from sklearn.utils.class_weight import compute_class_weightweights = compute_class_weight(‘balanced’, y.unique(), y)
array([ 0.51722354, 15.01501502])
You can then pass these weights when training the model. For example, in the case of logistic regression:
然后,在訓練模型時可以傳遞這些權重。 例如,對于邏輯回歸:
class_weights = {0:0.51722354,
1:15.01501502
}lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=class_weights)
Alternatively, you can pass the class weights as balanced and the weights will be automatically adjusted.
或者,您可以將班級權重傳遞為balanced ,并且權重將自動調整。
lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=’balanced’)Here’s the ROC curve before the weights are adjusted.
這是調整權重之前的ROC曲線。
And here’s the ROC curve after the weights have been adjusted. Note the AUC moved from 0.69 to 0.87.
這是權重調整后的ROC曲線。 請注意,AUC從0.69變為0.87。
嘗試不同的算法 (Try Different Algorithms)
As you focus on the right metrics for imbalanced data, you can also try out different algorithms. Generally, tree-based algorithms perform better on imbalanced data. Furthermore, some algorithms such as LightGBM have hyperparameters that can be tuned to indicate that the data is not balanced.
當您專注于針對不平衡數據的正確指標時,您還可以嘗試不同的算法。 通常,基于樹的算法在不平衡數據上表現更好。 此外,某些算法(例如LightGBM)具有超參數,可以對其進行調整以指示數據不平衡。
生成綜合數據 (Generate Synthetic Data)
You can also generate synthetic data to increase the number of records in the minority class — usually known as oversampling. This is usually done on the training set after doing the train test split. In Python, this can be done using the Imblearn package. One of the strategies that can be implemented from the package is known as the Synthetic Minority Over-sampling Technique (SMOTE). The technique is based on k-nearest neighbors.
您還可以生成綜合數據,以增加少數派類別中的記錄數量(通常稱為過采樣)。 通常在進行火車測試拆分后,對訓練集執行此操作。 在Python中,可以使用Imblearn包來完成。 可以從該軟件包中實施的策略之一就是合成少數族裔過采樣技術(SMOTE) 。 該技術基于k最近鄰。
When using SMOTE:
使用SMOTE時:
The first parameter is a float that indicates the ratio of the number of samples in the minority class to the number of samples in the majority class, once resampling has been done.
第一個參數是float ,表示完成重采樣后,少數類中的樣本數與多數類中的樣本數之比。
The number of neighbors to be used to generate the synthetic samples can be specified via the k_neighbors parameter.
可以通過k_neighbors指定用于生成合成樣本的k_neighbors 參數。
1 7733
dtype: int64
You can then fit your resampled data to your model.
然后,您可以將重新采樣的數據擬合到模型中。
model = LogisticRegression()model.fit(X_resampled,y_resampled)predictions = model.predict(X_test)多數類別欠采樣 (Undersample the Majority Class)
You can also experiment on reducing the number of samples in the majority class. One such strategy that can be implemented is the NearMiss method. You can also specify the ratio just like in SMOTE, as well as the number of neighbors via n_neighbors.
您也可以嘗試減少多數類中的樣本數量。 可以實施的一種這樣的策略是NearMiss方法。 您也可以像在n_neighbors一樣指定比率,并通過n_neighbors鄰居的數量。
from imblearn.under_sampling import NearMissunderSample = NearMiss(0.3,random_state=1545)pd.Series(y_resampled).value_counts()0 1110 1 333 dtype: int64最后的想法 (Final Thoughts)
Other techniques that can be used include using building an ensemble of weak learners to create a strong classifier. Metrics such as precision-recall curve and area under curve (PR, AUC) are also worth trying when the positive class is the most important.
可以使用的其他技術包括使用一組弱學習者來創建強分類器。 當肯定類別最重要時,諸如精確調用曲線和曲線下面積(PR,AUC)之類的指標也值得嘗試。
As always, you should experiment with different techniques and settle on the ones that give you the best results for your specific problems. Hopefully, this piece has given some insights on how to get started.
與往常一樣,您應該嘗試不同的技術,然后選擇能夠為您的特定問題提供最佳結果的技術。 希望這篇文章對如何入門提供了一些見解。
Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.
編者注: 心跳 是由貢獻者驅動的在線出版物和社區,致力于探索移動應用程序開發和機器學習的新興交集。 我們致力于為各行各業的開發人員和工程師提供支持和啟發。
Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.
Heartbeat在編輯上是獨立的,由以下機構贊助和發布 Fritz AI ,一種機器學習平臺,可幫助開發人員教設備看,聽,感知和思考。 我們向貢獻者付款,并且不出售廣告。
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.
如果您想做出貢獻,請繼續我們的 呼吁捐助者 。 您還可以注冊以接收我們的每周新聞通訊(《 深度學習每周》 和《 Fritz AI新聞通訊》 ),并加入我們 Slack ,然后繼續關注Fritz AI Twitter 提供了有關移動機器學習的所有最新信息。
翻譯自: https://heartbeat.fritz.ai/dealing-with-imbalanced-data-in-machine-learning-18e45fea7bb5
機器學習 處理不平衡數據
總結
以上是生活随笔為你收集整理的机器学习 处理不平衡数据_在机器学习中处理不平衡数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 苹果xs金色为什么便宜(苹果官网报价)
- 下一篇: 人口预测和阻尼-增长模型_使用分类模型预