當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据安全分类分级实施指南_不平衡数据集分类指南

發(fā)布時(shí)間：2023/12/14 编程问答 41 豆豆

生活随笔收集整理的這篇文章主要介紹了数据安全分类分级实施指南_不平衡数据集分类指南小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)安全分類分級(jí)實(shí)施指南

重點(diǎn) (Top highlight)

Balance within the imbalance to balance what’s imbalanced — Amadou Jarou Bah

在不平衡中保持平衡以平衡不平衡— Amadou Jarou Bah

Disclaimer: This is a comprehensive tutorial on handling imbalanced datasets. Whilst these approaches remain valid for multiclass classification, the main focus of this article will be on binary classification for simplicity.

免責(zé)聲明：這是有關(guān)處理不平衡數(shù)據(jù)集的綜合教程。盡管這些方法對(duì)于多類分類仍然有效，但為簡(jiǎn)單起見，本文的主要重點(diǎn)將放在二進(jìn)制分類上。

介紹 (Introduction)

As any seasoned data scientist or statistician will be aware of, datasets are rarely distributed evenly across attributes of interest. Let’s imagine we are tasked with discovering fraudulent credit card transactions — naturally, the vast majority of these transactions will be legitimate, and only a very small proportion will be fraudulent. Similarly, if we are testing individuals for cancer, or for the presence of a virus (COVID-19 included), the positive rate will (hopefully) be only a small fraction of those tested. More examples include:

正如任何經(jīng)驗(yàn)豐富的數(shù)據(jù)科學(xué)家或統(tǒng)計(jì)學(xué)家都會(huì)意識(shí)到的那樣，數(shù)據(jù)集很少會(huì)在感興趣的屬性之間均勻分布。想象一下，我們負(fù)有發(fā)現(xiàn)欺詐性信用卡交易的任務(wù)-自然，這些交易中的絕大多數(shù)都是合法的，只有很小一部分是欺詐性的。同樣，如果我們正在測(cè)試個(gè)人是否患有癌癥或是否存在病毒(包括COVID-19)，那么(希望)陽性率僅是所測(cè)試者的一小部分。更多示例包括：

An e-commerce company predicting which users will buy items on their platform
一家電子商務(wù)公司預(yù)測(cè)哪些用戶將在其平臺(tái)上購買商品
A manufacturing company analyzing produced materials for defects
一家制造公司分析所生產(chǎn)材料的缺陷
Spam email filtering trying to differentiation ‘ham’ from ‘spam’
垃圾郵件過濾試圖區(qū)分“火腿”和“垃圾郵件”
Intrusion detection systems examining network traffic for malware signatures or atypical port activity
入侵檢測(cè)系統(tǒng)檢查網(wǎng)絡(luò)流量中是否存在惡意軟件簽名或非典型端口活動(dòng)
Companies predicting churn rates amongst their customers
預(yù)測(cè)客戶流失率的公司
Number of clients who closed a specific account in a bank or financial organization
在銀行或金融組織中關(guān)閉特定帳戶的客戶數(shù)量
Prediction of telecommunications equipment failures
預(yù)測(cè)電信設(shè)備故障
Detection of oil spills from satellite images
從衛(wèi)星圖像檢測(cè)漏油
Insurance risk modeling
保險(xiǎn)風(fēng)險(xiǎn)建模
Hardware fault detection
硬件故障檢測(cè)

One has usually much fewer datapoints from the adverse class. This is unfortunate as we care a lot about avoiding misclassifying elements of this class.

通常，來自不利類的數(shù)據(jù)點(diǎn)少得多。 這很不幸，因?yàn)槲覀兎浅Ｔ谝獗苊鈱?duì)此類元素進(jìn)行錯(cuò)誤分類。

In actual fact, it is pretty rare to have perfectly balanced data in classification tasks. Oftentimes the items we are interested in analyzing are inherently ‘rare’ events for the very reason that they are rare and hence difficult to predict. This presents a curious problem for aspiring data scientists since many data science programs do not properly address how to handle imbalanced datasets given their prevalence in industry.

實(shí)際上，在分類任務(wù)中擁有完全平衡的數(shù)據(jù)非常罕見。通常，我們感興趣的項(xiàng)目本質(zhì)上是“稀有”事件，原因是它們很少見，因此難以預(yù)測(cè)。對(duì)于有抱負(fù)的數(shù)據(jù)科學(xué)家而言，這是一個(gè)令人好奇的問題，因?yàn)殍b于其在行業(yè)中的普遍性，許多數(shù)據(jù)科學(xué)程序無法正確解決如何處理不平衡的數(shù)據(jù)集。

數(shù)據(jù)集什么時(shí)候變得“不平衡”？ (When does a dataset become ‘imbalanced’?)

The notion of an imbalanced dataset is a somewhat vague one. Generally, a dataset for binary classification with a 49–51 split between the two variables would not be considered imbalanced. However, if we have a dataset with a 90–10 split, it seems obvious to us that this is an imbalanced dataset. Clearly, the boundary for imbalanced data lies somewhere between these two extremes.

不平衡數(shù)據(jù)集的概念有些模糊。通常，在兩個(gè)變量之間劃分為49-51的二進(jìn)制分類數(shù)據(jù)集不會(huì)被認(rèn)為是不平衡的。但是，如果我們有一個(gè)90-10分割的數(shù)據(jù)集，對(duì)我們來說顯然這是一個(gè)不平衡的數(shù)據(jù)集。顯然，不平衡數(shù)據(jù)的邊界介于這兩個(gè)極端之間。

In some sense, the term ‘imbalanced’ is a subjective one and it is left to the discretion of the data scientist. In general, a dataset is considered to be imbalanced when standard classification algorithms — which are inherently biased to the majority class (further details in a previous article) — return suboptimal solutions due to a bias in the majority class. A data scientist may look at a 45–55 split dataset and judge that this is close enough that measures do not need to be taken to correct for the imbalance. However, the more imbalanced the dataset becomes, the greater the need is to correct for this imbalance.

從某種意義上說，“不平衡”一詞是主觀的，由數(shù)據(jù)科學(xué)家自行決定。通常，當(dāng)標(biāo)準(zhǔn)分類算法(固有地偏向多數(shù)類(在上一篇文章中有更多詳細(xì)信息))由于多數(shù)類的偏向而返回次優(yōu)解時(shí)，則認(rèn)為數(shù)據(jù)集不平衡。數(shù)據(jù)科學(xué)家可以查看45–55的分割數(shù)據(jù)集，并判斷該數(shù)據(jù)集足夠接近，因此無需采取措施來糾正不平衡。但是，數(shù)據(jù)集變得越不平衡，就越需要糾正這種不平衡。

In a concept-learning problem, the data set is said to present a class imbalance if it contains many more examples of one class than the other.

在概念學(xué)習(xí)問題中，如果數(shù)據(jù)集包含一個(gè)類別的實(shí)例多于另一個(gè)類別的實(shí)例，則稱該數(shù)據(jù)集存在類別不平衡。

As a result, these classifiers tend to ignore small classes while concentrating on classifying the large ones accurately.

結(jié)果，這些分類器傾向于忽略小類別，而專注于準(zhǔn)確地對(duì)大類別進(jìn)行分類。

Imagine you are working for Netflix and are tasked with determining which customer churn rates (a customer ‘churning’ means they will stop using your services or using your products).

想象您正在為Netflix工作，并負(fù)責(zé)確定哪些客戶流失率(客戶“流失”意味著他們將停止使用您的服務(wù)或產(chǎn)品)。

In an ideal world (at least for the data scientist), our training and testing datasets would be close to fully balanced, having around 50% of the dataset containing individuals that will churn and 50% who will not. In this case, a 90% accuracy will more or less indicate a 90% accuracy on both the positively and negatively classed groups. Our errors will be evenly split across both groups. In addition, we have roughly the same number of points in both classes, which from the law of large numbers tells us reduces the overall variance in the class. This is great for us, accuracy is an informative metric in this situation and we can continue with our analysis unimpeded.

在理想的世界中(至少對(duì)于數(shù)據(jù)科學(xué)家而言)，我們的訓(xùn)練和測(cè)試數(shù)據(jù)集將接近完全平衡，大約50％的數(shù)據(jù)集包含會(huì)攪動(dòng)的人和50％不會(huì)攪動(dòng)的人。在這種情況下，90％的準(zhǔn)確度將或多或少地表明在正面和負(fù)面分類組中都達(dá)到90％的準(zhǔn)確度。我們的錯(cuò)誤將平均分配給兩個(gè)組。此外，兩個(gè)類中的點(diǎn)數(shù)大致相同，這從大數(shù)定律可以看出，這減少了類中的總體方差。這對(duì)我們來說非常好，在這種情況下，準(zhǔn)確性是一個(gè)有用的指標(biāo)，我們可以繼續(xù)進(jìn)行不受阻礙的分析。

A dataset with an even 50–50 split across the binary response variable. There is no majority class in this example.二進(jìn)制響應(yīng)變量之間均分50–50的數(shù)據(jù)集。此示例中沒有多數(shù)類。

As you may have suspected, most people that already pay for Netflix don't have a 50% chance of stopping their subscription every month. In fact, the percentage of people that will churn is rather small, closer to a 90–10 split. How does the presence of this dataset imbalance complicate matters?

您可能會(huì)懷疑，大多數(shù)已經(jīng)為Netflix付款的人沒有50％的機(jī)會(huì)每月停止訂閱。實(shí)際上，會(huì)流失的人數(shù)比例很小，接近90-10。這個(gè)數(shù)據(jù)集的不平衡如何使問題復(fù)雜化？

Assuming a 90–10 split, we now have a very different data story to tell. Giving this data to an algorithm without any further consideration will likely result in an accuracy close to 90%. This seems pretty good, right? It’s about the same as what we got previously. If you try putting this model into production your boss will probably not be so happy.

假設(shè)拆分為90-10，我們現(xiàn)在要講一個(gè)非常不同的數(shù)據(jù)故事。將此數(shù)據(jù)提供給算法而無需進(jìn)一步考慮，可能會(huì)導(dǎo)致接近90％的精度。這看起來還不錯(cuò)吧？它與我們之前獲得的內(nèi)容大致相同。如果您嘗試將這種模型投入生產(chǎn)，您的老板可能不會(huì)很高興。

An imbalanced dataset with a 90–10 split. False positives will be much larger than false negatives. Variance in the minority set will be larger due to fewer data points. The majority class will dominate algorithmic predictions without any correction for imbalance.分割為90-10的不平衡數(shù)據(jù)集。假陽性比假陰性要大得多。由于較少的數(shù)據(jù)點(diǎn)，少數(shù)派集中的方差會(huì)更大。多數(shù)類將主導(dǎo)算法預(yù)測(cè)，而無需對(duì)不平衡進(jìn)行任何校正。

Given the prevalence of the majority class (the 90% class), our algorithm will likely regress to a prediction of the majority class. The algorithm can pretty closely maximize its accuracy (our scoring metric of choice) by arbitrarily predicting that the majority class occurs every time. This is a trivial result and provides close to zero predictive power.

給定多數(shù)類別(90％類別)的患病率，我們的算法可能會(huì)回歸到多數(shù)類別的預(yù)測(cè)。通過任意預(yù)測(cè)每次都會(huì)出現(xiàn)多數(shù)類，該算法可以非常精確地最大程度地提高其準(zhǔn)確性(我們的選擇評(píng)分標(biāo)準(zhǔn))。這是微不足道的結(jié)果，并提供接近零的預(yù)測(cè)能力。

(Left) A balanced dataset with the same number of items in the positive and negative class; the number of false positives and false negatives in this scenario are roughly equivalent and result in little classification bias. (Right) An imbalanced dataset with around 5% of samples being in the negative class and 95% of samples being in the positive class (this could be the number of people that pay for Netflix that decide to quit during the next payment cycle).(左)一個(gè)平衡的數(shù)據(jù)集，其中正數(shù)和負(fù)數(shù)類的項(xiàng)目數(shù)相同；在這種情況下，假陽性和假陰性的數(shù)量大致相等，并且?guī)缀鯖]有分類偏差。 (右)一個(gè)不平衡的數(shù)據(jù)集，其中約5％的樣本屬于負(fù)面類別，而95％的樣本屬于正面類別(這可能是為Netflix付款并決定在下一個(gè)付款周期退出的人數(shù))。

Predictive accuracy, a popular choice for evaluating the performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly.

當(dāng)數(shù)據(jù)不平衡和/或不同錯(cuò)誤的成本明顯不同時(shí)，預(yù)測(cè)準(zhǔn)確性是評(píng)估分類器性能的一種普遍選擇，可能不合適。

Visually, this dataset might look something like this:

從視覺上看，該數(shù)據(jù)集可能看起來像這樣：

Machine learning algorithms by default assume that data is balanced. In classification, this corresponds to a comparative number of instances of each class. Classifiers learn better from a balanced distribution. It is up to the data scientist to correct for imbalances, which can be done in multiple ways.

默認(rèn)情況下，機(jī)器學(xué)習(xí)算法假定數(shù)據(jù)是平衡的。在分類中，這對(duì)應(yīng)于每個(gè)類的比較實(shí)例數(shù)。分類器從均衡的分布中學(xué)習(xí)得更好。數(shù)據(jù)科學(xué)家可以糾正不平衡，這可以通過多種方式來完成。

不同類型的失衡 (Different Types of Imbalance)

We have clearly shown that imbalanced datasets have some additional challenges to standard datasets. To further complicate matters, there are different types of imbalance that can occur in a dataset.

我們已經(jīng)清楚地表明，不平衡的數(shù)據(jù)集對(duì)標(biāo)準(zhǔn)數(shù)據(jù)集還有一些其他挑戰(zhàn)。更復(fù)雜的是，數(shù)據(jù)集中可能會(huì)出現(xiàn)不同類型的失衡。

(1) Between-Class

(1)課間

A between-class imbalance occurs when there is an imbalance in the number of data points contained within each class. An example of this is shown below:

當(dāng)每個(gè)類中包含的數(shù)據(jù)點(diǎn)數(shù)量不平衡時(shí)，將發(fā)生類間不平衡。下面是一個(gè)示例：

An illustration of between-class imbalance. We have a large number of data points for the red class but relatively few for the white class.類間失衡的例證。紅色類別的數(shù)據(jù)點(diǎn)很多，而白色類別的數(shù)據(jù)點(diǎn)相對(duì)較少。

An example of this would be a mammography dataset, which uses images known as mammograms to predict breast cancer. Consider the number of mammograms related to positive and negative cancer diagnoses:

這樣的一個(gè)例子是乳腺X射線攝影數(shù)據(jù)集，它使用稱為乳腺X線照片的圖像來預(yù)測(cè)乳腺癌。考慮與陽性和陰性癌癥診斷相關(guān)的乳房X線照片數(shù)量：

The vast majority of samples (>90%) are negative, whilst relatively few (<10%) are positive.絕大多數(shù)樣本(> 90％)為陰性，而相對(duì)少數(shù)(<10％)為陽性。

Note that given enough data samples in both classes the accuracy will improve as the sampling distribution is more representative of the data distribution, but by virtue of the law of large numbers, the majority class will have inherently better representation than the minority class.

請(qǐng)注意，如果兩個(gè)類別中都有足夠的數(shù)據(jù)樣本，則精度會(huì)隨著采樣分布更能代表數(shù)據(jù)分布而提高，但是由于數(shù)量規(guī)律，多數(shù)類別在本質(zhì)上要比少數(shù)類別更好。

(2) Within-Class

(2)班內(nèi)

A within-class imbalance occurs when the dataset has balanced between-class data but one of the classes is not representative in some regions. An example of this is shown below:

當(dāng)數(shù)據(jù)集具有平衡的類間數(shù)據(jù)，但其中一個(gè)類在某些區(qū)域中不具有代表性時(shí)，會(huì)發(fā)生類內(nèi)不平衡。下面是一個(gè)示例：

An illustration of within-class imbalance. We have a large number of data points for both classes but the number of data points in the white class in the top left corner is very sparse, which can result in similar complications as between-class imbalance for predictions in those regions.類內(nèi)失衡的例證。這兩個(gè)類別都有大量數(shù)據(jù)點(diǎn)，但是左上角的白色類別中的數(shù)據(jù)點(diǎn)數(shù)量非常稀疏，這可能導(dǎo)致與這些區(qū)域中的類間不平衡預(yù)測(cè)相似的復(fù)雜情況。

(3) Intrinsic and Extrinsic

(3)內(nèi)部和外部

An intrinsic imbalance is due to the nature of the dataset, while extrinsic imbalance is related to time, storage, and other factors that limit the dataset or the data analysis. Intrinsic characteristics are relatively simple and are what we commonly see, but extrinsic imbalance can exist separately and can also work to increase the imbalance of a dataset.

內(nèi)在的不平衡歸因于數(shù)據(jù)集的性質(zhì)， 而外在的不平衡則與時(shí)間，存儲(chǔ)以及其他限制數(shù)據(jù)集或數(shù)據(jù)分析的因素有關(guān)。內(nèi)部特征相對(duì)簡(jiǎn)單，這是我們通常看到的特征，但是外部不平衡可以單獨(dú)存在，也可以用來增加數(shù)據(jù)集的不平衡。

For example, companies often use intrusion detection systems that analyze packets of data sent in and out of networks in order to detect malware of malicious activity. Depending on whether you analyze all data or just data sent through specific ports or specific devices, this will significantly influence the imbalance of the dataset (most network traffic is likely legitimate). Similarly, if log files or data packets related to suspected malicious behavior are commonly stored but normal log are not (or only a select few types are stored), then this can also influence the imbalance of the dataset. Similarly, if logs were only stored during a normal working day (say, 9–5 PM) instead of 24 hours, this will also affect the imbalance.

例如，公司經(jīng)常使用入侵檢測(cè)系統(tǒng)來分析進(jìn)出網(wǎng)絡(luò)的數(shù)據(jù)包，以檢測(cè)惡意活動(dòng)的惡意軟件。根據(jù)您是分析所有數(shù)據(jù)還是僅分析通過特定端口或特定設(shè)備發(fā)送的數(shù)據(jù)，這將嚴(yán)重影響數(shù)據(jù)集的不平衡(大多數(shù)網(wǎng)絡(luò)流量可能是合法的)。同樣，如果通常存儲(chǔ)與可疑惡意行為有關(guān)的日志文件或數(shù)據(jù)包，但不存儲(chǔ)常規(guī)日志(或僅存儲(chǔ)少數(shù)幾種類型的日志)，則這也可能會(huì)影響數(shù)據(jù)集的不平衡。同樣，如果日志僅在正常工作日(例如9-5 PM)而非24小時(shí)內(nèi)存儲(chǔ)，這也會(huì)影響不平衡。

不平衡的進(jìn)一步復(fù)雜化 (Further Complication of Imbalance)

There are a couple more difficulties increased by imbalanced datasets. Firstly, we have class overlapping. This is not always a problem, but can often arise in imbalanced learning problems and cause headaches. Class overlapping is illustrated in the below dataset.

不平衡的數(shù)據(jù)集會(huì)增加更多的困難。首先，我們有班級(jí)重疊 。這并不總是一個(gè)問題，但是經(jīng)常會(huì)在學(xué)習(xí)不平衡的問題中出現(xiàn)并引起頭痛。下面的數(shù)據(jù)集說明了類重疊。

Example of class overlapping. Some of the positive data points (stars) are intermixed with the negative data points (circles), which would lead an algorithm to construct an imperfect decision boundary.類重疊的示例。一些正數(shù)據(jù)點(diǎn)(星號(hào))與負(fù)數(shù)據(jù)點(diǎn)(圓)混合在一起，這將導(dǎo)致算法構(gòu)造不完善的決策邊界。

Class overlapping occurs in normal classification problems, so what is the additional issue here? Well, the class more represented in overlap regions tends to be better classified by methods based on global learning (on the full dataset). This is because the algorithm is able to get a more informed picture of the data distribution of the majority class.

在正常的分類問題中會(huì)發(fā)生類重疊，那么這里還有什么其他問題？好吧，在重疊區(qū)域中表示更多的類傾向于通過基于全局學(xué)習(xí)的方法(在完整數(shù)據(jù)集上)更好地分類。這是因?yàn)樵撍惴軌颢@得多數(shù)類數(shù)據(jù)分布的更多信息。

In contrast, the class less represented in such regions tends to be better classified by local methods. If we take k-NN as an example, as the value of k increases, it becomes increasingly global and increasingly local. It can be shown that performance for low values of k has better performance on the minority dataset, and lower performance at high values of k. This shift in accuracy is not exhibited for the majority class because it is well-represented at all points.

相反，在此類區(qū)域中較少代表的類別傾向于通過本地方法更好地分類。如果以k-NN為例，隨著k值的增加，它變得越來越全球化，也越來越局部化。可以證明，k值較低時(shí)的性能在少數(shù)數(shù)據(jù)集上具有較好的性能，而k值較高時(shí)的性能較低。準(zhǔn)確性的這種變化在大多數(shù)類別中都沒有表現(xiàn)出來，因?yàn)樗谒蟹矫娑嫉玫搅撕芎玫捏w現(xiàn)。

This suggests that local methods may be better suited for studying the minority class. One method to correct for this is the CBO Method. The CBO Method uses cluster-based resampling to identify ‘rare’ cases and resample them individually, so as to avoid the creation of small disjuncts in the learned hypothesis. This is a method of oversampling — a topic that we will discuss in detail in the following section.

這表明本地方法可能更適合于研究少數(shù)群體。一種糾正此問題的方法是CBO方法 。 CBO方法使用基于聚類的重采樣來識(shí)別“稀有”案例并分別對(duì)其進(jìn)行重采樣，以避免在學(xué)習(xí)的假設(shè)中產(chǎn)生小的歧義。這是一種過采樣的方法-我們將在下一節(jié)中詳細(xì)討論這個(gè)主題。

CBO Method. Once the training examples of each class have been clustered, oversampling starts. In the majority class, all the clusters, except for the largest one, are randomly oversampled so as to get the same number of training examples as the largest cluster.CBO方法。一旦將每個(gè)班級(jí)的訓(xùn)練示例進(jìn)行了聚類，就會(huì)開始進(jìn)行過度采樣。在多數(shù)類中，除最大的聚類外，所有聚類均被隨機(jī)過采樣，以便獲得與最大聚類相同數(shù)量的訓(xùn)練樣例。

糾正數(shù)據(jù)集不平衡 (Correcting Dataset Imbalance)

There are several techniques to control for dataset imbalance. There are two main types of techniques to handle imbalanced datasets: sampling methods, and cost-sensitive methods.

有幾種控制數(shù)據(jù)集不平衡的技術(shù)。處理不平衡數(shù)據(jù)集的技術(shù)主要有兩種： 抽樣方法和成本敏感方法 。

The simplest and most commonly used of these are sampling methods called oversampling and undersampling, which we will go into more detail on.

其中最簡(jiǎn)單，最常用的是稱為過采樣和欠采樣的采樣方法，我們將對(duì)其進(jìn)行詳細(xì)介紹。

Oversampling/Undersampling

過采樣/欠采樣

Simply stated, oversampling involves generating new data points for the minority class, and undersampling involves removing data points from the majority class. This acts to somewhat reduce the extent of the imbalance in the dataset.

簡(jiǎn)而言之，過采樣涉及為少數(shù)類生成新的數(shù)據(jù)點(diǎn)，而欠采樣涉及從多數(shù)類中刪除數(shù)據(jù)點(diǎn)。這在某種程度上減少了數(shù)據(jù)集中的不平衡程度。

What does undersampling look like? We continually remove like-samples in close proximity until both classes have the same number of data points.

欠采樣是什么樣的？我們會(huì)不斷刪除附近的相似樣本，直到兩個(gè)類具有相同數(shù)量的數(shù)據(jù)點(diǎn)。

Undersampling. Imagine you are analysing a dataset for fraudulent transactions. Most of the transactions are not fraudulent, creating a fundamentally imbalanced dataset. In the scenario of undersampling, we will take fewer samples from the majority class to help reduce the extent of this imbalance.欠采樣。假設(shè)您正在分析數(shù)據(jù)集中的欺詐性交易。大多數(shù)交易不是欺詐性的，從而造成了根本上不平衡的數(shù)據(jù)集。在抽樣不足的情況下，我們將從多數(shù)類別中抽取較少的樣本，以幫助減少這種不平衡的程度。

Is undersampling a good idea? Undersampling is recommended by many statistical researchers but is only good if enough data points are available on the undersampled class. Also, since the majority class will end up with the same number of points as the minority class, the statistical properties of the distributions will become ‘looser’ in a sense. However, we have not artificially distorted the data distribution with this method by adding in artificial data points.

采樣不足是個(gè)好主意嗎？許多統(tǒng)計(jì)研究人員建議進(jìn)行欠采樣，但是只有在欠采樣類別上有足夠的數(shù)據(jù)點(diǎn)可用時(shí)，采樣才是好的。同樣，由于多數(shù)類最終將獲得與少數(shù)類相同的分?jǐn)?shù)，因此從某種意義上說，分布的統(tǒng)計(jì)屬性將變?yōu)椤拜^弱”。但是，我們沒有通過添加人工數(shù)據(jù)點(diǎn)來使用這種方法人為地扭曲數(shù)據(jù)分布。

Illustration of undersampling. Like-samples in close proximity are removed in an attempt to increase the sparsity of the data distribution.欠采樣的插圖。為了提高數(shù)據(jù)分布的稀疏性，刪除了附近的相似樣本。

What does oversampling look like? In shot, the opposite of undersampling. We are artificially adding data points to our dataset to make the number of instances in each class balanced.

過采樣看起來像什么？在拍攝中，欠采樣的情況與之相反。我們正在人為地向數(shù)據(jù)集中添加數(shù)據(jù)點(diǎn)，以使每個(gè)類中的實(shí)例數(shù)量保持平衡。

Oversampling. In the scenario of oversampling, we will oversample from the minority class to help reduce the extent of this imbalance.過采樣。在過度采樣的情況下，我們將對(duì)少數(shù)群體進(jìn)行過度采樣，以幫助減少這種不平衡的程度。

How do we generate these samples? The most common way is to generate points that are close in dataspace proximity to existing samples or are ‘between’ two samples, as illustrated below.

我們?nèi)绾紊蛇@些樣本？最常見的方法是生成在數(shù)據(jù)空間中與現(xiàn)有樣本接近或在兩個(gè)樣本“之間”的點(diǎn)，如下所示。

Illustration of oversampling.過度采樣的插圖。

As you may have suspected, there are some downsides to adding false data points. Firstly, you risk overfitting, especially if one does this for points that are noise — you end up exacerbating this noise by adding reinforced measurements. In addition, adding these values randomly can also contribute additional noise to our model.

您可能已經(jīng)懷疑過，添加錯(cuò)誤的數(shù)據(jù)點(diǎn)有一些缺點(diǎn)。首先，您可能會(huì)面臨過度擬合的風(fēng)險(xiǎn)，特別是如果對(duì)噪聲點(diǎn)進(jìn)行過度擬合時(shí)，最終會(huì)通過添加增強(qiáng)的測(cè)量來加劇這種噪聲。此外，隨機(jī)添加這些值也會(huì)給我們的模型帶來額外的噪聲。

SMOTE (Synthetic minority oversampling technique)

SMOTE(合成少數(shù)群體過采樣技術(shù))

Luckily for us, we don’t have to write an algorithm for randomly generating data points for the purpose of oversampling. Instead, we can use the SMOTE algorithm.

對(duì)我們來說幸運(yùn)的是，我們不必編寫用于過采樣的隨機(jī)生成數(shù)據(jù)點(diǎn)的算法。相反，我們可以使用SMOTE算法。

How does SMOTE work? SMOTE generates new samples in between existing data points based on their local density and their borders with the other class. Not only does it perform oversampling, but can subsequently use cleaning techniques (undersampling, more on this shortly) to remove redundancy in the end. Below is an illustration for how SMOTE works when studying class data.

SMOTE如何工作？ SMOTE根據(jù)現(xiàn)有數(shù)據(jù)點(diǎn)的局部密度及其與其他類別的邊界在新數(shù)據(jù)點(diǎn)之間生成新樣本。它不僅執(zhí)行過采樣，而且可以隨后使用清除技術(shù)(欠采樣，稍后對(duì)此進(jìn)行更多介紹)最終消除冗余。下面是學(xué)習(xí)班級(jí)數(shù)據(jù)時(shí)SMOTE如何工作的圖示。

An illustration of how SMOTE functions. The instance on the left is isolated and is thus considered noise by the algorithm. No additional data points are generated in its proximity, or, if they are, they will be in very close proximity to the singular point. The two clusters in the center and right have several data points, indicating that it is less likely that these points correspond to random noise. Thus, a larger cluster (empirical data distribution) can be drawn by the algorithm from which additional samples can be generated.SMOTE的功能說明。左側(cè)的實(shí)例被隔離，因此被算法視為噪聲。不會(huì)在其附近生成任何其他數(shù)據(jù)點(diǎn)，或者如果它們是，它們將非常靠近奇異點(diǎn)。中央和右側(cè)的兩個(gè)群集具有幾個(gè)數(shù)據(jù)點(diǎn)，表明這些點(diǎn)對(duì)應(yīng)于隨機(jī)噪聲的可能性較小。因此，可以通過該算法得出更大的聚類(經(jīng)驗(yàn)數(shù)據(jù)分布)，從中可以生成其他樣本。

The algorithm for SMOTE is as follows. For each minority sample:

SMOTE的算法如下。對(duì)于每個(gè)少數(shù)族裔樣本：

– Find its k-nearest minority neighbours

–尋找其k最近的少數(shù)族裔鄰居

– Randomly select j of these neighbours

–隨機(jī)選擇這些鄰居中的j個(gè)

– Randomly generate synthetic samples along the lines joining the minority sample and its j selected neighbours (j depends on the amount of oversampling desired)

–沿連接少數(shù)樣本及其j個(gè)選定鄰居的直線隨機(jī)生成合成樣本(j取決于所需的過采樣量)

Informed vs. Random Oversampling

知情vs.隨機(jī)過采樣

Using random oversampling (with replacement) of the minority class has the effect of making the decision region for the minority class very specific. In a decision tree, it would cause a new split and often lead to overfitting. SMOTE’s informed oversampling generalizes the decision region for the minority class. As a result, larger and less specific regions are learned, thus, paying attention to minority class samples without causing overfitting.

使用少數(shù)類的隨機(jī)過采樣 (替換)具有使少數(shù)類的決策區(qū)域非常具體的效果。在決策樹中，這將導(dǎo)致新的分裂并經(jīng)常導(dǎo)致過度擬合。 SMOTE的明智超采樣概括了少數(shù)群體的決策區(qū)域。結(jié)果，學(xué)習(xí)了更大和更少的特定區(qū)域，因此，在不引起過度擬合的情況下注意少數(shù)類樣本。

Drawbacks of SMOTE

SMOTE的缺點(diǎn)

Overgeneralization. SMOTE’s procedure can be dangerous since it blindly generalizes the minority area without regard to the majority class. This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture.

過度概括。 SMOTE的程序可能很危險(xiǎn)，因?yàn)樗つ康貙⑸贁?shù)民族地區(qū)泛化而無視多數(shù)階級(jí)。這種策略在階級(jí)分布高度偏斜的情況下尤其成問題，因?yàn)樵谶@種情況下，少數(shù)階級(jí)相對(duì)于多數(shù)階級(jí)而言非常稀疏，因此導(dǎo)致階級(jí)混合的機(jī)會(huì)更大。

Inflexibility. The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.

僵硬。 SMOTE生成的合成樣本的數(shù)量是預(yù)先固定的，因此再平衡速率不具有任何靈活性。

Another potential issue is that SMOTE might introduce the artificial minority class examples too deeply in the majority class space. This drawback can be resolved by hybridization: combining SMOTE with undersampling algorithms. One of the most famous of these is Tomek Links. Tomek Links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.

另一個(gè)潛在的問題是，SMOTE可能會(huì)在多數(shù)階層的空間中過于深入地介紹人工少數(shù)群體的例子。這個(gè)缺點(diǎn)可以通過雜交解決：將SMOTE與欠采樣算法結(jié)合在一起。其中最著名的就是Tomek Links 。 Tomek鏈接是一對(duì)相反類別的實(shí)例，它們是自己最近的鄰居。換句話說，它們是一對(duì)非常靠近的相對(duì)實(shí)例。

Tomek’s algorithm looks for such pairs and removes the majority instance of the pair. The idea is to clarify the border between the minority and majority classes, making the minority region(s) more distinct. Scikit-learn has no built-in modules for doing this, though there are some independent packages (e.g., TomekLink, imbalanced-learn).

Tomek的算法會(huì)查找此類對(duì)，并刪除該對(duì)的多數(shù)實(shí)例。這樣做的目的是弄清少數(shù)民族和多數(shù)階級(jí)之間的界限，使少數(shù)民族地區(qū)更加鮮明。盡管有一些獨(dú)立的軟件包(例如TomekLink ， imbalanced -learn )，但Scikit-learn沒有內(nèi)置模塊可以執(zhí)行此操作。

Thus, Tomek’s algorithm is an undersampling technique that acts as a data cleaning method for SMOTE to regulate against redundancy. As you may have suspected, there are many additional undersampling techniques that can be combined with SMOTE to perform the same function. A comprehensive list of these functions can be found in the functions section of the imbalanced-learn documentation.

因此，Tomek的算法是一種欠采樣技術(shù)，可作為SMOTE調(diào)節(jié)冗余的數(shù)據(jù)清洗方法。您可能已經(jīng)懷疑，還有許多其他的欠采樣技術(shù)可以與SMOTE結(jié)合使用以執(zhí)行相同的功能。這些功能的全面列表可在不平衡學(xué)習(xí)文檔的功能部分中找到。

An additional example is Edited Nearest Neighbors (ENN). ENN removes any example whose class label differs from the class of at least two of their neighbor. ENN removes more examples than the Tomek links does and also can remove examples from both classes.

另一個(gè)示例是“最近的鄰居”(ENN)。 ENN刪除任何其類別標(biāo)簽不同于其至少兩個(gè)鄰居的類別的示例。與Tomek鏈接相比，ENN刪除的示例更多，并且還可以從兩個(gè)類中刪除示例。

Other more nuanced versions of SMOTE include Borderline SMOTE, SVMSMOTE, and KMeansSMOTE, and more nuanced versions of the undersampling techniques applied in concert with SMOTE are Condensed Nearest Neighbor (CNN), Repeated Edited Nearest Neighbor, and Instance Hardness Threshold.

SMOTE的其他細(xì)微差別版本包括Borderline SMOTE，SVMSMOTE和KMeansSMOTE，與SMOTE結(jié)合使用的欠采樣技術(shù)的細(xì)微差別版本是壓縮最近鄰(CNN)，重復(fù)編輯最近鄰和實(shí)例硬度閾值。

成本敏感型學(xué)習(xí) (Cost-Sensitive Learning)

We have discussed sampling techniques and are now ready to discuss cost-sensitive learning. In many ways, the two approaches are analogous — the main difference being that in cost-sensitive learning we perform under- and over-sampling by altering the relative weighting of individual samples.

我們已經(jīng)討論了采樣技術(shù)，現(xiàn)在準(zhǔn)備討論對(duì)成本敏感的學(xué)習(xí)。在許多方面，這兩種方法是相似的-主要區(qū)別在于在成本敏感型學(xué)習(xí)中，我們通過更改單個(gè)樣本的相對(duì)權(quán)重來進(jìn)行欠采樣和過采樣。

Upweighting. Upweighting is analogous to over-sampling and works by increasing the weight of one of the classes keeping the weight of the other class at one.

增重。 上權(quán)類似于過采樣，其工作方式是增加一個(gè)類別的權(quán)重，將另一類別的權(quán)重保持為一個(gè)。

Down-weighting. Down-weighting is analogous to under-sampling and works by decreasing the weight of one of the classes keeping the weight of the other class at one.

減重。 減權(quán)類似于欠采樣，它通過減小一個(gè)類別的權(quán)重而將另一類別的權(quán)重保持為一個(gè)來工作。

An example of how this can be performed using sklearn is via the sklearn.utils.class_weight function and applied to any sklearn classifier (and within keras).

如何使用sklearn執(zhí)行此操作的示例是通過sklearn.utils.class_weight函數(shù)并將其應(yīng)用于任何sklearn分類器(以及在keras中)。

from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
model.fit(X_train, y_train, class_weight=class_weights)

In this case, we have set the instances to be ‘balanced’, meaning that we will treat these instances to have balanced weighting based on their relative number of points — this is what I would recommend unless you have a good reason for setting the values yourself. If you have three classes and wanted to weight one of them 10x larger and another 20x larger (because there are 10x and 20x fewer of these points in the dataset than the majority class), then we can rewrite this as:

在這種情況下，我們將實(shí)例設(shè)置為“平衡”，這意味著我們將根據(jù)它們的相對(duì)點(diǎn)數(shù)將這些實(shí)例視為具有均衡的權(quán)重-這是我的建議，除非您有充分的理由來設(shè)置值你自己如果您有三個(gè)類別，并且想要將其中一個(gè)類別的權(quán)重放大10倍，將另一個(gè)類別的權(quán)重增大20倍(因?yàn)閿?shù)據(jù)集中這些點(diǎn)的數(shù)量比多數(shù)類別少10倍和20倍)，則可以將其重寫為：

class_weight = {0: 0.1,
1: 1.,
2: 2.}

Some authors claim that cost-sensitive learning is slightly more effective than random or directed over- or under-sampling, although all approaches are helpful, and directed oversampling, is close to cost-sensitive learning in efficacy. Personally, when I am working on a machine learning problem I will use cost-sensitive learning because it is much simpler to implement and communicate to individuals. However, there may be additional aspects of using sampling techniques that provide superior results of which I am not aware.

一些作者聲稱，成本敏感型學(xué)習(xí)比隨機(jī)或有針對(duì)性的過度采樣或欠采樣略有效果，盡管所有方法都是有幫助的，有針對(duì)性的過度采樣在效果上接近于成本敏感型學(xué)習(xí)。就個(gè)人而言，當(dāng)我處理機(jī)器學(xué)習(xí)問題時(shí)，我將使用成本敏感型學(xué)習(xí)，因?yàn)樗子趯?shí)現(xiàn)并與個(gè)人進(jìn)行交流。但是，使用采樣技術(shù)可能存在其他方面，這些方面提供了我所不知道的優(yōu)異結(jié)果。

評(píng)估指標(biāo) (Assessment Metrics)

In this section, I outline several metrics that can be used to analyze the performance of a classifier trained to solve a binary classification problem. These include (1) the confusion matrix, (2) binary classification metrics, (3) the receiver operating characteristic curve, and (4) the precision-recall curve.

在本節(jié)中，我概述了幾個(gè)可用于分析經(jīng)過訓(xùn)練以解決二進(jìn)制分類問題的分類器的性能的指標(biāo)。其中包括(1)混淆矩陣，(2)二進(jìn)制分類指標(biāo)，(3)接收器工作特性曲線和(4)精確調(diào)用曲線。

混淆矩陣 (Confusion Matrix)

Despite what you may have garnered from its name, a confusion matrix is decidedly confusing. A confusion matrix is the most basic form of assessment of a binary classifier. Given the prediction outputs of our classifier and the true response variable, a confusion matrix tells us how many of our predictions are correct for each class, and how many are incorrect. The confusion matrix provides a simple visualization of the performance of a classifier based on these factors.

盡管您可能從它的名字中學(xué)到了什么，但是混亂矩陣顯然令人困惑。混淆矩陣是二進(jìn)制分類器評(píng)估的最基本形式。給定分類器的預(yù)測(cè)輸出和真實(shí)的響應(yīng)變量，混淆矩陣會(huì)告訴我們每個(gè)類別正確的預(yù)測(cè)有多少，不正確的預(yù)測(cè)有多少。混淆矩陣基于這些因素提供了分類器性能的簡(jiǎn)單可視化。

Here is an example of a confusion matrix:

這是一個(gè)混淆矩陣的示例：

Hopefully what this is showing is relatively clear. The TN cell tells us the number of true positives: the number of positive samples that I predicted were positive.

希望這顯示的是相對(duì)清楚的。 TN細(xì)胞告訴我們真正的陽性數(shù)量：我預(yù)測(cè)的陽性樣品數(shù)量為陽性。

The TP cell tells us the number of true negatives: the number of negative samples that I predicted were negative.

TP單元告訴我們真實(shí)陰性的數(shù)量：我預(yù)測(cè)的陰性樣品的數(shù)量為陰性。

The FP cell tells us the number of false positives: the number of negative samples that I predicted were positive.

FP細(xì)胞告訴我們假陽性的數(shù)量：我預(yù)測(cè)的陰性樣品的數(shù)量是陽性的。

The FN cell tells us the number of false negatives: the number of positive samples that I predicted were positive.

FN細(xì)胞告訴我們假陰性的數(shù)量：我預(yù)測(cè)的陽性樣品的數(shù)量為陽性。

These numbers are very important as they form the basis of the binary classification metrics discussed next.

這些數(shù)字非常重要，因?yàn)樗鼈儤?gòu)成了下面討論的二進(jìn)制分類指標(biāo)的基礎(chǔ)。

二進(jìn)制分類指標(biāo) (Binary Classification Metrics)

There are a plethora of single-value metrics for binary classification. As such, only a few of the most commonly used ones and their different formulations are presented here, more details can be found on scoring metrics in the sklearn documentation and on their relation to confusion matrices and ROC curves (discussed in the next section) here.

二進(jìn)制分類有很多單值指標(biāo)。因此，此處僅介紹一些最常用的方法及其不同的公式，有關(guān)更多詳細(xì)信息，請(qǐng)參見sklearn文檔中的評(píng)分指標(biāo)以及它們與混淆矩陣和ROC曲線的關(guān)系(在下一節(jié)中討論) 。。

Arguably the most important five metrics for binary classification are: (1) precision, (2) recall, (3) F1 score, (4) accuracy, and (5) specificity.

可以說，二元分類最重要的五個(gè)指標(biāo)是：(1)精度，(2)回憶，(3)F1得分，(4)準(zhǔn)確性和(5)特異性。

Precision. Precision provides us with the answer to the question “Of all my positive predictions, what proportion of them are correct?”. If you have an algorithm that predicts all of the positive class correctly but also has a large portion of false positives, the precision will be small. It makes sense why this is called precision since it is a measure of how ‘precise’ our predictions are.

精確。 Precision為我們提供了以下問題的答案： “在我所有的積極預(yù)測(cè)中，有多少是正確的？” 。如果您有一種算法可以正確預(yù)測(cè)所有肯定分類，但也有很大一部分誤報(bào)，則精度會(huì)很小。之所以將其稱為“精度”是有道理的，因?yàn)樗梢院饬课覀兊念A(yù)測(cè)有多“精確”。

Recall. Recall provides us with the answer to a different question “Of all of the positive samples, what proportion did I predict correctly?”. Instead of false positives, we are now interested in false negatives. These are items that our algorithm missed, and are often the most egregious errors (e.g. failing to diagnose something with cancer that actually has cancer, failing to discover malware when it is present, or failing to spot a defective item). The name ‘recall’ also makes sense for this circumstance as we are seeing how many of the samples the algorithm was able to pick up on.

召回。 Recall為我們提供了一個(gè)不同問題的答案： “在所有陽性樣本中，我正確預(yù)測(cè)的比例是多少？” 。現(xiàn)在，我們對(duì)假陰性感興趣了，而不是假陽性。這些是我們的算法遺漏的項(xiàng)目，并且通常是最嚴(yán)重的錯(cuò)誤(例如，未能診斷出確實(shí)患有癌癥的癌癥，無法發(fā)現(xiàn)惡意軟件或存在缺陷的項(xiàng)目)。在這種情況下，“召回”這個(gè)名稱也很有意義，因?yàn)槲覀兛吹搅嗽撍惴軌蛱崛《嗌賯€(gè)樣本。

It should be clear that these questions, whilst related, are substantially different to each other. It is possible to have a very high precision and simultaneously have a low recall, and vice versa. For example, if you predicted the majority class every time, you would have 100% recall on the majority class, but you would then get a lot of false positives from the minority class.

應(yīng)當(dāng)明確的是，這些問題雖然相關(guān)，但彼此之間卻有很大不同。可能有很高的精度，同時(shí)召回率也很低，反之亦然。例如，如果您每次都預(yù)測(cè)多數(shù)派，則多數(shù)派將有100％的回憶率，但隨后您將從少數(shù)派中得到很多誤報(bào)。

One other important point to make is that precision and recall can be determined for each individual class. That is, we can talk about the precision of class A, or the precision of class B, and they will have different values — when doing this, we assume that the class we are interested in is the positive class, regardless of its numeric value.

另一個(gè)重要的觀點(diǎn)是， 可以為每個(gè)單獨(dú)的類確定精度和召回率 。也就是說，我們可以談?wù)擃怉的精度或類B的精度，并且它們將具有不同的值-這樣做時(shí)，我們假設(shè)我們感興趣的類是正類，而不管其數(shù)值如何。

.。

F1 Score. The F1 score is a single-value metric that combines precision and recall by using the harmonic mean (a fancy type of averaging). The β parameter is a strictly positive value that is used to describe the relative importance of recall to precision. A larger β value puts a higher emphasis on recall than precision, whilst a smaller value puts less emphasis. If the value is 1, precision and recall are treated with equal weighting.

F1分?jǐn)?shù)。 F1分?jǐn)?shù)是一個(gè)單值指標(biāo)，通過使用諧波均值(一種奇特的平均值)將精度和召回率結(jié)合在一起。 β參數(shù)是一個(gè)嚴(yán)格的正值，用于描述召回對(duì)精度的相對(duì)重要性。 β值較大時(shí)，對(duì)查全率的重視程度要高于精度，而β值較小時(shí)，對(duì)查全率的重視程度較低。如果該值為1，則精度和召回率將以相等的權(quán)重處理。

What does a high F1 score mean? It suggests that both the precision and recall have high values — this is good and is what you would hope to see upon generating a well-functioning classification model on an imbalanced dataset. A low value indicates that either precision or recall is low, and maybe a call for concern. Good F1 scores are generally lower than good accuracies (in many situations, an F1 score of 0.5 would be considered pretty good, such as predicting breast cancer from mammograms).

F1高分意味著什么？它表明精度和查全率都具有很高的值-這很好，這是在不平衡數(shù)據(jù)集上生成功能良好的分類模型時(shí)希望看到的。較低的值表示準(zhǔn)確性或召回率較低，可能表示需要關(guān)注。良好的F1分?jǐn)?shù)通常低于良好的準(zhǔn)確性(在許多情況下，F1分?jǐn)?shù)0.5被認(rèn)為是相當(dāng)不錯(cuò)的，例如根據(jù)乳房X線照片預(yù)測(cè)乳腺癌)。

Specificity. Simply stated, specificity is the recall of negative values. It answers the question “Of all of my negative predictions, what proportion of them are correct?”. This may be important in situations where examining the relative proportion of false positives is necessary.

特異性。 簡(jiǎn)而言之，特異性就是召回負(fù)值。它回答了一個(gè)問題： “在我所有的負(fù)面預(yù)測(cè)中，有多少比例是正確的？” 。這在需要檢查假陽性的相對(duì)比例的情況下可能很重要。

Macro, Micro, and Weighted Scores

宏觀，微觀和加權(quán)分?jǐn)?shù)

This is where things get a little complicated. Anyone who has delved into these metrics on sklearn may have noticed that we can refer to the recall-macro or f1-weighted score.

這會(huì)使事情變得有些復(fù)雜。認(rèn)真研究了sklearn的這些指標(biāo)的任何人都可能已經(jīng)注意到，我們可以參考召回宏或f1加權(quán)得分。

A macro-F1 score is the average of F1 scores across each class.

宏觀F1分?jǐn)?shù)是每個(gè)課程中F1分?jǐn)?shù)的平均值。

This is most useful if we have many classes and we are interested in the average F1 score for each class. If you only care about the F1 score for one class, you probably won’t need a macro-F1 score.

如果我們有很多班，并且我們對(duì)每個(gè)班的平均F1成績(jī)感興趣，這將是最有用的。如果您只關(guān)心一個(gè)班級(jí)的F1分?jǐn)?shù)，則可能不需要宏F1分?jǐn)?shù)。

A micro-F1 score takes all of the true positives, false positives, and false negatives from all the classes and calculates the F1 score.

微型F1分?jǐn)?shù)采用所有類別中的所有真實(shí)肯定，錯(cuò)誤肯定和錯(cuò)誤否定，并計(jì)算F1得分。

The micro-F1 score is pretty similar in utility to the macro-F1 score as it gives an aggregate performance of a classifier over multiple classes. That being said, they will give different results and understand the underlying difference in that result may be informative for a given application.

微型F1得分的效用與宏觀F1得分非常相似，因?yàn)樗峁┝硕鄠€(gè)類別的分類器的綜合性能。話雖如此，他們將給出不同的結(jié)果，并了解該結(jié)果的根本差異可能對(duì)給定的應(yīng)用程序有幫助。

A weighted-F1 score is the same as the macro-F1 score, but each of the class-specific F1 scores is scaled by the relative number of samples from that class.

加權(quán)F1分?jǐn)?shù)與宏F1分?jǐn)?shù)相同，但是每個(gè)類別特定的F1分?jǐn)?shù)均根據(jù)該類別的樣本的相對(duì)數(shù)量進(jìn)行縮放。

In this case, N refers to the proportion of samples in the dataset belonging to a single class. For class A, where class A is the majority class, this might be equal to 0.8 (80%). The values for B and C might be 0.15 and 0.05, respectively.

在這種情況下， N是指數(shù)據(jù)集中屬于單個(gè)類別的樣本所占的比例。對(duì)于A類，其中A類為多數(shù)類，這可能等于0.8(80％)。 B和C的值分別為0.15和0.05。

For a highly imbalanced dataset, a large weighted-F1 score might be somewhat misleading because it is overly influenced by the majority class.

對(duì)于高度不平衡的數(shù)據(jù)集，較大的F1加權(quán)分?jǐn)?shù)可能會(huì)引起誤導(dǎo)，因?yàn)樗艿蕉鄶?shù)類別的過度影響。

Other Metrics

其他指標(biāo)

Some other metrics that you may see around that can be informative for binary classification (and multiclass classification to some extent) are:

您可能會(huì)發(fā)現(xiàn)的一些其他指標(biāo)可對(duì)二進(jìn)制分類(在某種程度上，以及多類分類)有所幫助：

Accuracy. If you are reading this, I would imagine you are already familiar with accuracy, but perhaps not so familiar with the others. Cast in the light of a metric for a confusion matrix, the accuracy can be described as the ratio of true predictions (positive and negative) to the sum of the total number of positive and negative samples.

準(zhǔn)確性。 如果您正在閱讀本文，我想您已經(jīng)對(duì)準(zhǔn)確性很熟悉，但對(duì)其他準(zhǔn)確性可能不太了解。根據(jù)混淆矩陣的度量標(biāo)準(zhǔn)，可以將準(zhǔn)確度描述為真實(shí)預(yù)測(cè)(陽性和陰性)與陽性和陰性樣本總數(shù)之和的比率。

G-Mean. A less common metric that is somewhat analogous to the F1 score is the G-Mean. This is often cast in two different formulations, the first being the precision-recall g-mean, and the second being the sensitivity-specificity g-mean. They can be used in a similar manner to the F1 score in terms of analyzing algorithmic performance. The precision-recall g-mean can also be referred to as the Fowlkes-Mallows Index.

G均值。 G均值是一種不太常見的指標(biāo)，與F1分?jǐn)?shù)有些相似。通常用兩種不同的公式表示，第一種是精確調(diào)用g均值，第二種是敏感性特異性g均值。就分析算法性能而言，它們可以與F1分?jǐn)?shù)類似的方式使用。精確調(diào)用g均值也可以稱為Fowlkes-Mallows索引。

There are many other metrics that can be used, but most have specialized use cases and offer little additional utility over the metrics described here. Other metrics the reader may be interested in viewing are balanced accuracy, Matthews correlation coefficient, markedness, and informedness.

可以使用許多其他指標(biāo)，但是大多數(shù)指標(biāo)都有專門的用例，并且與此處描述的指標(biāo)相比，幾乎沒有其他用途。讀者可能感興趣的其他指標(biāo)是平衡的準(zhǔn)確性，馬修斯相關(guān)系數(shù) ，標(biāo)記性和信息靈通性。

Receiver Operating Characteristic (ROC) Curve

接收器工作特性(ROC)曲線

An ROC curve is a two-dimensional graph to depicts trade-offs between benefits (true positives) and costs (false positives). It displays a relation between sensitivity and specificity for a given classifier (binary problems, parameterized classifier or a score classification).

ROC曲線是一個(gè)二維圖形，用于描述收益(真實(shí)肯定)和成本(錯(cuò)誤真實(shí))之間的權(quán)衡。它顯示了給定分類器(二進(jìn)制問題，參數(shù)化分類器或分?jǐn)?shù)分類)的敏感性和特異性之間的關(guān)系。

Here is an example of an ROC curve.

這是ROC曲線的示例。

There is a lot to unpack here. Firstly, the dotted line through the center corresponds to a classifier that acts as a ‘coin flip’. That is, it is correct roughly 50% of the time and is the worst possible classifier (we are just guessing). This acts as our baseline, against which we can compare all other classifiers — these classifiers should be closer to the top left corner of the plot since we want high true positive rates in all cases.

這里有很多要解壓的東西。首先，通過中心的虛線對(duì)應(yīng)于充當(dāng)“硬幣翻轉(zhuǎn)”的分類器。也就是說，大約50％的時(shí)間是正確的，并且是最糟糕的分類器(我們只是在猜測(cè))。這是我們的基準(zhǔn)，可以與所有其他分類器進(jìn)行比較-這些分類器應(yīng)更靠近圖的左上角，因?yàn)樵谒星闆r下我們都希望有較高的真實(shí)陽性率。

It should be noted that an ROC curve does not assess a group of classifiers. Rather, it examines a single classifier over a set of classification thresholds.

應(yīng)該注意的是，ROC曲線不評(píng)估一組分類器。而是，它在一組分類閾值上檢查單個(gè)分類器。

What does this mean? It means that for one point, I take my classifier and set the threshold to be 0.3 (30% propensity) and then assess the true positive and false positive rates.

這是什么意思？這意味著，我將分類器的閾值設(shè)置為0.3(傾向性為30％)，然后評(píng)估真實(shí)的陽性和假陽性率。

True Positive Rate: Percentage of true positives (to the sum of true positives and false negatives) generated by the combination of a specific classifier and classification threshold.

真實(shí)肯定率： 特定分類器和分類閾值的組合所生成的 真實(shí)肯定率 (相對(duì)于真實(shí)肯定率和錯(cuò)誤否定率)。

False Positive Rate: Percentage of false positives (to the sum of false positives and true negatives) generated by the combination of a specific classifier and classification threshold.

誤報(bào)率： 特定分類器和分類閾值的組合所產(chǎn)生的誤報(bào)率(占誤報(bào)率和真實(shí)否定值的總和)。

This gives me two numbers, which I can then plot on the curve. I then take another threshold, say 0.4, and repeat this process. After doing this for every threshold of interest (perhaps in 0.1, 0.01, or 0.001 increments), we have constructed an ROC curve for this classifier.

這給了我兩個(gè)數(shù)字，然后可以在曲線上繪制它們。然后，我將另一個(gè)閾值設(shè)為0.4，然后重復(fù)此過程。在對(duì)每個(gè)感興趣的閾值執(zhí)行此操作后(可能以0.1、0.01或0.001為增量)，我們?yōu)榇朔诸惼鳂?gòu)建了ROC曲線。

An example ROC curve showing how an individual point is plotted. A classifier is selected along with a classification threshold. Following this, the true positive rate and false positive rate for this combination of classification and threshold are calculated and subsequently plotted.示例ROC曲線顯示了如何繪制單個(gè)點(diǎn)。選擇分類器以及分類閾值。此后，針對(duì)分類和閾值的這種組合，計(jì)算出真陽性率和假陽性率，并隨后進(jìn)行繪圖。

What is the point of doing this? Depending on your application, you may be very averse to false positives as they may be very costly (e.g. launches of nuclear missiles) and thus would like a classifier that has a very low false-positive rate. Conversely, you may not care so much about having a highfalse positive rate as long as you get a high true positive rate (stopping most events of fraud may be worth it even if you have to check many more occurrences that are flagged by the algorithm as flawed). For the optimal balance between these two ratios (where false positives and false negatives are equally costly), we would take the classification threshold which results in the minimum diagonal distance from the top left corner.

這樣做有什么意義？根據(jù)您的應(yīng)用，您可能會(huì)反對(duì)誤報(bào)，因?yàn)檎`報(bào)的代價(jià)可能很高(例如，發(fā)射核導(dǎo)彈)，因此希望分類器的誤報(bào)率非常低。相反，只要您獲得很高的真實(shí)陽性率，您可能就不會(huì)太在意高假陽性率(即使必須檢查該算法標(biāo)記為的更多事件，停止大多數(shù)欺詐事件也是值得的)有缺陷的)。為了在這兩個(gè)比率之間實(shí)現(xiàn)最佳平衡(假陽性和假陰性的代價(jià)均相同)，我們將采用分類閾值，以使距左上角的對(duì)角線距離最小。

Why does the top left corner correspond to the ideal classifier? The ideal point on the ROC curve would be (0,100), that is, all positive examples are classified correctly and no negative examples are misclassified as positive. In a perfect classifier, there would be no misclassification!

為什么左上角對(duì)應(yīng)于理想分類器？ ROC曲線上的理想點(diǎn)是(0,100) ，也就是說，所有正樣本都正確分類，沒有負(fù)樣本被誤分類為正樣本。在一個(gè)完美的分類器中，不會(huì)出現(xiàn)分類錯(cuò)誤！

Whilst a graph may not seem pretty useful in itself, it is helpful in comparing classifiers. One particular metric, the Area Under Curve (AUC) score, allows us to compare classifiers by comparing the total area underneath the line produced on the ROC curve. For an ideal classifier, the AUC equals 1, since we are multiplying 100% (1.0) true positive rate by 100% (1.0) false-positive rate. If a particular classifier has an ROC of 0.6 and another has an ROC of 0.8, the latter is clearly a better classifier. The AUC has the benefit that it is independent of the decision criteria — the classification threshold — and thus makes it easier to compare these classifiers.

雖然圖本身似乎不太有用，但它有助于比較分類器。一種特殊的度量標(biāo)準(zhǔn)，即曲線下面積(AUC)得分，使我們可以通過比較ROC曲線上生成的線下的總面積來比較分類器。 For an ideal classifier, the AUC equals 1, since we are multiplying 100% (1.0) true positive rate by 100% (1.0) false-positive rate. If a particular classifier has an ROC of 0.6 and another has an ROC of 0.8, the latter is clearly a better classifier. The AUC has the benefit that it is independent of the decision criteria — the classification threshold — and thus makes it easier to compare these classifiers.

A question may have come to mind now — what if some classifiers are better at lower thresholds and some are better at higher thresholds? This is where the ROC convex hull comes in. The convex hull provides us with a method of identifying potentially optimal classifiers — even though we may not have directly observed them, we can infer their existence. Consider the following diagram:

Source: Source: QuoraQuora

Given a family of ROC curves, the ROC convex hull can include points that are more towards the top left corner (perfect classifier) of the ROC space. If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive intercept. Thus, the classifier at that point is optimal under any distribution assumptions in tandem with that slope. This is perhaps easier to understand after examining the image.

How does undersampling/oversampling influence the ROC curve? A famous paper on SMOTE (discussed previously) titled “SMOTE: Synthetic Minority Over-sampling Technique” outlines that by undersampling the majority class, we force the ROC curve to move up and to the right, and thus has the potential to increase the AUC of a given classifier (this is essentially just validation that SMOTE functions correctly, as expected). Similarly, oversampling the minority class has a similar impact.

How does undersampling/oversampling influence the ROC curve? A famous paper on SMOTE (discussed previously) titled “ SMOTE: Synthetic Minority Over-sampling Technique ” outlines that by undersampling the majority class, we force the ROC curve to move up and to the right, and thus has the potential to increase the AUC of a given classifier (this is essentially just validation that SMOTE functions correctly, as expected). Similarly, oversampling the minority class has a similar impact.

Source: Source : ResearchgateResearchgate

Precision-Recall (PR) Curves

An analogous diagram to an ROC curve can be recast from ROC space and reformulated into PR space. These diagrams are in many ways analogous to the ROC curve, but instead of plotting recall against fallout (true positive rate vs. false positive rate), we are instead plotting precision against recall. This produces a somewhat mirror-image (the curve itself will look somewhat different) of the ROC curve in the sense that the top right corner of a PR curve designates the ideal classifier. This can often be more understandable than an ROC curve but provides very similar information. The area under a PR curve is often called mAP and is analogous to the AUC in ROC space.

Source: Source: Researchgate — Ten quick tips for machine learning in computational biologyResearchgate — Ten quick tips for machine learning in computational biology

Final Comments (Final Comments)

Imbalanced datasets are underrepresented (no pun intended) in many data science programs contrary to their prevalence and importance in many industrial machine learning applications. It is the job of the data scientist to be able to recognize when a dataset is imbalanced and follow procedures and utilize metrics that allow this imbalance to be sufficiently understood and controlled.

I hope that in the course of reading this article you have learned something about dealing with imbalanced datasets and are in the future will be comfortable in the face of such imbalanced problems. If you are a serious data scientist, it is only a matter of time before one of these applications will pop up!

Newsletter (Newsletter)

For updates on new blog posts and extra content, sign up for my newsletter.

翻譯自: https://towardsdatascience.com/guide-to-classification-on-imbalanced-datasets-d6653aa5fa23