日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习 处理不平衡数据_在机器学习中处理不平衡数据

發(fā)布時間:2023/12/15 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 机器学习 处理不平衡数据_在机器学习中处理不平衡数据 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

機(jī)器學(xué)習(xí) 處理不平衡數(shù)據(jù)

As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation where you have hundreds of records for one class label and thousands of records for another class label.

作為ML工程師或數(shù)據(jù)科學(xué)家,有時您不可避免地會遇到這樣的情況:一個類標(biāo)簽有數(shù)百條記錄,而另一個類標(biāo)簽有數(shù)千條記錄。

Upon training your model you obtain an accuracy above 90%. You then realize that the model is predicting everything as if it’s in the class with the majority of records. Excellent examples of this are fraud detection problems and churn prediction problems, where the majority of the records are in the negative class. What do you do in such a scenario? That will be the focus of this post.

訓(xùn)練模型后,您可以獲得90%以上的準(zhǔn)確性。 然后,您會意識到該模型正在預(yù)測所有內(nèi)容,就好像它屬于具有大部分記錄的類一樣。 欺詐檢測問題和客戶流失預(yù)測問題就是一個很好的例子,其中大多數(shù)記錄為負(fù)類。 在這種情況下您會做什么? 這將是這篇文章的重點(diǎn)。

收集更多數(shù)據(jù) (Collect More Data)

The most straightforward and obvious thing to do is to collect more data, especially data points on the minority class. This will obviously improve the performance of the model. However, this is not always possible. Apart from the cost one would have to incur, sometimes it's not feasible to collect more data. For example, in the case of churn prediction and fraud detection, you can’t just wait for more incidences to occur so that you can collect more data.

最直接,最明顯的方法是收集更多數(shù)據(jù),尤其是有關(guān)少數(shù)群體的數(shù)據(jù)點(diǎn)。 這顯然會改善模型的性能。 但是,這并不總是可能的。 除了必須承擔(dān)的費(fèi)用外,有時收集更多數(shù)據(jù)也不可行。 例如,對于流失預(yù)測和欺詐檢測,您不能僅等待發(fā)生更多的事件以收集更多的數(shù)據(jù)。

考慮精度以外的指標(biāo) (Consider Metrics Other than Accuracy)

Accuracy is not a good way to measure the performance of a model where the class labels are imbalanced. In this case, it's prudent to consider other metrics such as precision, recall, Area Under the Curve (AUC) — just to mention a few.

精度不是衡量類標(biāo)簽不平衡的模型性能的好方法。 在這種情況下,請謹(jǐn)慎考慮其他指標(biāo),例如精度,召回率,曲線下面積(AUC)-僅舉幾例。

Precision measures the ratio of the true positives among all the samples that were predicted as true positives and false positives. For example, out of the number of people our model predicted would churn, how many actually churned?

精度測量所有被預(yù)測為真陽性和假陽性的樣本中真陽性的比率。 例如,在我們的模型預(yù)測的流失人數(shù)中,實(shí)際上有多少人會流失?

Recall measures the ratio of the true positives from the sum of the true positives and the false negatives. For example, the percentage of people who churned that our model predicted would churn.

召回率衡量的是真實(shí)肯定與錯誤肯定的總和。 例如,我們的模型預(yù)測的會攪動的人群會流失。

The AUC is obtained from the Receiver Operating Characteristics (ROC) curve. The curve is obtained by plotting the true positive rate against the false positive rate. The false positive rate is obtained by dividing the false positives by the sum of the false positives and the true negatives.

AUC從接收器工作特性(ROC)曲線獲得。 通過繪制真實(shí)的陽性率對假陽性率來獲得曲線。 誤報率是通過將誤報除以誤報和真實(shí)否定之和得出的。

AUC closer to one is better, since it indicates that the model is able to find the true positives.

AUC越接近一個越好,因?yàn)樗砻髟撃P湍軌蛘业秸鎸?shí)的陽性結(jié)果。

Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.

機(jī)器學(xué)習(xí)正Swift向收集數(shù)據(jù)的地方(邊緣設(shè)備)靠近。 訂閱Fritz AI新聞通訊以了解有關(guān)此過渡及其如何幫助您擴(kuò)展業(yè)務(wù)的更多信息 。

強(qiáng)調(diào)少數(shù)民族階層 (Emphasize the Minority Class)

Another way to deal with imbalanced data is to have your model focus on the minority class. This can be done by computing the class weights. The model will focus on the class with a higher weight. Eventually, the model will be able to learn equally from both classes. The weights can be computed with the help of scikit-learn.

處理不平衡數(shù)據(jù)的另一種方法是讓模型關(guān)注少數(shù)群體。 這可以通過計(jì)算類權(quán)重來完成。 該模型將重點(diǎn)關(guān)注權(quán)重較高的課程。 最終,該模型將能夠從兩個類中平均學(xué)習(xí)。 權(quán)重可以借助scikit-learn進(jìn)行計(jì)算。

from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight(‘balanced’, y.unique(), y)
array([ 0.51722354, 15.01501502])

You can then pass these weights when training the model. For example, in the case of logistic regression:

然后,在訓(xùn)練模型時可以傳遞這些權(quán)重。 例如,對于邏輯回歸:

class_weights = {
0:0.51722354,
1:15.01501502
}lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=class_weights)

Alternatively, you can pass the class weights as balanced and the weights will be automatically adjusted.

或者,您可以將班級權(quán)重傳遞為balanced ,并且權(quán)重將自動調(diào)整。

lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=’balanced’)

Here’s the ROC curve before the weights are adjusted.

這是調(diào)整權(quán)重之前的ROC曲線。

And here’s the ROC curve after the weights have been adjusted. Note the AUC moved from 0.69 to 0.87.

這是權(quán)重調(diào)整后的ROC曲線。 請注意,AUC從0.69變?yōu)?.87。

嘗試不同的算法 (Try Different Algorithms)

As you focus on the right metrics for imbalanced data, you can also try out different algorithms. Generally, tree-based algorithms perform better on imbalanced data. Furthermore, some algorithms such as LightGBM have hyperparameters that can be tuned to indicate that the data is not balanced.

當(dāng)您專注于針對不平衡數(shù)據(jù)的正確指標(biāo)時,您還可以嘗試不同的算法。 通常,基于樹的算法在不平衡數(shù)據(jù)上表現(xiàn)更好。 此外,某些算法(例如LightGBM)具有超參數(shù),可以對其進(jìn)行調(diào)整以指示數(shù)據(jù)不平衡。

生成綜合數(shù)據(jù) (Generate Synthetic Data)

You can also generate synthetic data to increase the number of records in the minority class — usually known as oversampling. This is usually done on the training set after doing the train test split. In Python, this can be done using the Imblearn package. One of the strategies that can be implemented from the package is known as the Synthetic Minority Over-sampling Technique (SMOTE). The technique is based on k-nearest neighbors.

您還可以生成綜合數(shù)據(jù),以增加少數(shù)派類別中的記錄數(shù)量(通常稱為過采樣)。 通常在進(jìn)行火車測試拆分后,對訓(xùn)練集執(zhí)行此操作。 在Python中,可以使用Imblearn包來完成。 可以從該軟件包中實(shí)施的策略之一就是合成少數(shù)族裔過采樣技術(shù)(SMOTE) 。 該技術(shù)基于k最近鄰。

When using SMOTE:

使用SMOTE時:

  • The first parameter is a float that indicates the ratio of the number of samples in the minority class to the number of samples in the majority class, once resampling has been done.

    第一個參數(shù)是float ,表示完成重采樣后,少數(shù)類中的樣本數(shù)與多數(shù)類中的樣本數(shù)之比。

  • The number of neighbors to be used to generate the synthetic samples can be specified via the k_neighbors parameter.

    可以通過k_neighbors指定用于生成合成樣本的k_neighbors 參數(shù)。

from imblearn.over_sampling import SMOTEsmote = SMOTE(0.8)X_resampled,y_resampled = smote.fit_resample(X.values,y.values)pd.Series(y_resampled).value_counts()0 9667
1 7733
dtype: int64

You can then fit your resampled data to your model.

然后,您可以將重新采樣的數(shù)據(jù)擬合到模型中。

model = LogisticRegression()model.fit(X_resampled,y_resampled)predictions = model.predict(X_test)

多數(shù)類別欠采樣 (Undersample the Majority Class)

You can also experiment on reducing the number of samples in the majority class. One such strategy that can be implemented is the NearMiss method. You can also specify the ratio just like in SMOTE, as well as the number of neighbors via n_neighbors.

您也可以嘗試減少多數(shù)類中的樣本數(shù)量。 可以實(shí)施的一種這樣的策略是NearMiss方法。 您也可以像在n_neighbors一樣指定比率,并通過n_neighbors鄰居的數(shù)量

from imblearn.under_sampling import NearMissunderSample = NearMiss(0.3,random_state=1545)pd.Series(y_resampled).value_counts()0 1110 1 333 dtype: int64

最后的想法 (Final Thoughts)

Other techniques that can be used include using building an ensemble of weak learners to create a strong classifier. Metrics such as precision-recall curve and area under curve (PR, AUC) are also worth trying when the positive class is the most important.

可以使用的其他技術(shù)包括使用一組弱學(xué)習(xí)者來創(chuàng)建強(qiáng)分類器。 當(dāng)肯定類別最重要時,諸如精確調(diào)用曲線和曲線下面積(PR,AUC)之類的指標(biāo)也值得嘗試。

As always, you should experiment with different techniques and settle on the ones that give you the best results for your specific problems. Hopefully, this piece has given some insights on how to get started.

與往常一樣,您應(yīng)該嘗試不同的技術(shù),然后選擇能夠?yàn)槟奶囟▎栴}提供最佳結(jié)果的技術(shù)。 希望這篇文章對如何入門提供了一些見解。

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.

編者注: 心跳 是由貢獻(xiàn)者驅(qū)動的在線出版物和社區(qū),致力于探索移動應(yīng)用程序開發(fā)和機(jī)器學(xué)習(xí)的新興交集。 我們致力于為各行各業(yè)的開發(fā)人員和工程師提供支持和啟發(fā)。

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.

Heartbeat在編輯上是獨(dú)立的,由以下機(jī)構(gòu)贊助和發(fā)布 Fritz AI ,一種機(jī)器學(xué)習(xí)平臺,可幫助開發(fā)人員教設(shè)備看,聽,感知和思考。 我們向貢獻(xiàn)者付款,并且不出售廣告。

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.

如果您想做出貢獻(xiàn),請繼續(xù)我們的 呼吁捐助者 。 您還可以注冊以接收我們的每周新聞通訊(《 深度學(xué)習(xí)每周》 和《 Fritz AI新聞通訊》 ),并加入我們 Slack ,然后繼續(xù)關(guān)注Fritz AI Twitter 提供了有關(guān)移動機(jī)器學(xué)習(xí)的所有最新信息。

翻譯自: https://heartbeat.fritz.ai/dealing-with-imbalanced-data-in-machine-learning-18e45fea7bb5

機(jī)器學(xué)習(xí) 處理不平衡數(shù)據(jù)

總結(jié)

以上是生活随笔為你收集整理的机器学习 处理不平衡数据_在机器学习中处理不平衡数据的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。