當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习知识点(三十六)分类器性能度量指标f1-score

發(fā)布時(shí)間：2025/4/16 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习知识点(三十六)分类器性能度量指标f1-score 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在用python機(jī)器學(xué)習(xí)庫(kù)scikit-learn訓(xùn)練模型時(shí)，常用f1-score來度量模型性能，下面回顧和學(xué)習(xí)下這個(gè)指標(biāo)。

內(nèi)容概要?

模型評(píng)估的目的及一般評(píng)估流程
分類準(zhǔn)確率的用處及其限制
混淆矩陣（confusion matrix）是如何表示一個(gè)分類器的性能
混淆矩陣中的度量是如何計(jì)算的
通過改變分類閾值來調(diào)整分類器性能
ROC曲線的用處
曲線下面積（Area Under the Curve, AUC）與分類準(zhǔn)確率的不同

1. 回顧?

模型評(píng)估可以用于在不同的模型類型、調(diào)節(jié)參數(shù)、特征組合中選擇適合的模型，所以我們需要一個(gè)模型評(píng)估的流程來估計(jì)訓(xùn)練得到的模型對(duì)于非樣本數(shù)據(jù)的泛化能力，并且還需要恰當(dāng)?shù)哪Ｐ驮u(píng)估度量手段來衡量模型的性能表現(xiàn)。

對(duì)于模型評(píng)估流程而言，之前介紹了K折交叉驗(yàn)證的方法，針對(duì)模型評(píng)估度量方法，回歸問題可以采用平均絕對(duì)誤差（Mean Absolute Error）、均方誤差（Mean Squared Error）、均方根誤差（Root Mean Squared Error），而分類問題可以采用分類準(zhǔn)確率和這篇文章中介紹的度量方法。

2. 分類準(zhǔn)確率（Classification accuracy）?

這里我們使用Pima Indians Diabetes dataset，其中包含健康數(shù)據(jù)和糖尿病狀態(tài)數(shù)據(jù)，一共有768個(gè)病人的數(shù)據(jù)。

In?[1]: # read the data into a Pandas DataFrame import pandas as pd url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data' col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] pima = pd.read_csv(url, header=None, names=col_names) In?[2]: # print the first 5 rows of data pima.head() Out[2]: ?pregnantglucosebpskininsulinbmipedigreeagelabel01234

6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1

上面表格中的label一列，1表示該病人有糖尿病，0表示該病人沒有糖尿病

In?[3]: # define X and y feature_cols = ['pregnant', 'insulin', 'bmi', 'age'] X = pima[feature_cols] y = pima.label In?[4]: # split X and y into training and testing sets from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) In?[5]: # train a logistic regression model on the training set from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X_train, y_train) Out[5]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr',penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0) In?[6]: # make class predictions for the testing set y_pred_class = logreg.predict(X_test) In?[7]: # calculate accuracy from sklearn import metrics print metrics.accuracy_score(y_test, y_pred_class) 0.692708333333

分類準(zhǔn)確率分?jǐn)?shù)是指所有分類正確的百分比。

空準(zhǔn)確率（null accuracy）是指當(dāng)模型總是預(yù)測(cè)比例較高的類別，那么其正確的比例是多少

In?[8]: # examine the class distribution of the testing set (using a Pandas Series method) y_test.value_counts() Out[8]: 0 130 1 62 dtype: int64 In?[9]: # calculate the percentage of ones y_test.mean() Out[9]: 0.32291666666666669 In?[10]: # calculate the percentage of zeros 1 - y_test.mean() Out[10]: 0.67708333333333326 In?[11]: # calculate null accuracy(for binary classification problems coded as 0/1) max(y_test.mean(), 1-y_test.mean()) Out[11]: 0.67708333333333326

我們看到空準(zhǔn)確率是68%，而分類準(zhǔn)確率是69%，這說明該分類準(zhǔn)確率并不是很好的模型度量方法，分類準(zhǔn)確率的一個(gè)缺點(diǎn)是其不能表現(xiàn)任何有關(guān)測(cè)試數(shù)據(jù)的潛在分布。

In?[12]: # calculate null accuracy (for multi-class classification problems) y_test.value_counts().head(1) / len(y_test) Out[12]: 0 0.677083 dtype: float64

比較真實(shí)和預(yù)測(cè)的類別響應(yīng)值：

In?[13]: # print the first 25 true and predicted responses print "True:", y_test.values[0:25] print "Pred:", y_pred_class[0:25] True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0] Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]

從上面真實(shí)值和預(yù)測(cè)值的比較中可以看出，當(dāng)正確的類別是0時(shí)，預(yù)測(cè)的類別基本都是0；當(dāng)正確的類別是1時(shí)，預(yù)測(cè)的類別大都不是1。換句話說，該訓(xùn)練的模型大都在比例較高的那項(xiàng)類別的預(yù)測(cè)中預(yù)測(cè)正確，而在另外一中類別的預(yù)測(cè)中預(yù)測(cè)失敗，而我們沒法從分類準(zhǔn)確率這項(xiàng)指標(biāo)中發(fā)現(xiàn)這個(gè)問題。

分類準(zhǔn)確率這一衡量分類器的標(biāo)準(zhǔn)比較容易理解，但是它不能告訴你響應(yīng)值的潛在分布，并且它也不能告訴你分類器犯錯(cuò)的類型。接下來介紹的混淆矩陣可以識(shí)別這個(gè)問題。

3. 混淆矩陣?

In?[14]: # IMPORTANT: first argument is true values, second argument is predicted values print metrics.confusion_matrix(y_test, y_pred_class) [[118 12][ 47 15]]

真陽性（True Positive，TP）：指被分類器正確分類的正例數(shù)據(jù)
真陰性（True Negative，TN）：指被分類器正確分類的負(fù)例數(shù)據(jù)
假陽性（False Positive，FP）：被錯(cuò)誤地標(biāo)記為正例數(shù)據(jù)的負(fù)例數(shù)據(jù)
假陰性（False Negative，FN）：被錯(cuò)誤地標(biāo)記為負(fù)例數(shù)據(jù)的正例數(shù)據(jù)

In?[15]: # save confusion matrix and slice into four pieces confusion = metrics.confusion_matrix(y_test, y_pred_class) TP = confusion[1, 1] TN = confusion[0, 0] FP = confusion[0, 1] FN = confusion[1, 0] print "TP:", TP print "TN:", TN print "FP:", FP print "FN:", FN TP: 15 TN: 118 FP: 12 FN: 47

4. 基于混淆矩陣的評(píng)估度量?

準(zhǔn)確率、識(shí)別率（Classification Accuracy）：分類器正確分類的比例

In?[16]: print (TP+TN) / float(TP+TN+FN+FP) print metrics.accuracy_score(y_test, y_pred_class) 0.692708333333 0.692708333333

錯(cuò)誤率、誤分類率（Classification Error）：分類器誤分類的比例

In?[17]: print (FP+FN) / float(TP+TN+FN+FP) print 1-metrics.accuracy_score(y_test, y_pred_class) 0.307291666667 0.307291666667

考慮類不平衡問題，其中感興趣的主類是稀少的。即數(shù)據(jù)集的分布反映負(fù)類顯著地占多數(shù)，而正類占少數(shù)。故面對(duì)這種問題，需要其他的度量，評(píng)估分類器正確地識(shí)別正例數(shù)據(jù)的情況和正確地識(shí)別負(fù)例數(shù)據(jù)的情況。

靈敏性（Sensitivity），也稱為真正例識(shí)別率、召回率（Recall）：正確識(shí)別的正例數(shù)據(jù)在實(shí)際正例數(shù)據(jù)中的百分比

In?[18]: print TP / float(TP+FN) recall = metrics.recall_score(y_test, y_pred_class) print metrics.recall_score(y_test, y_pred_class) 0.241935483871 0.241935483871

特效性（Specificity），也稱為真負(fù)例率：正確識(shí)別的負(fù)例數(shù)據(jù)在實(shí)際負(fù)例數(shù)據(jù)中的百分比

In?[19]: print TN / float(TN+FP) 0.907692307692

假陽率（False Positive Rate）：實(shí)際值是負(fù)例數(shù)據(jù)，預(yù)測(cè)錯(cuò)誤的百分比

In?[20]: print FP / float(TN+FP) specificity = TN / float(TN+FP) print 1 - specificity 0.0923076923077 0.0923076923077

精度（Precision）：看做精確性的度量，即標(biāo)記為正類的數(shù)據(jù)實(shí)際為正例的百分比

In?[21]: print TP / float(TP+FP) precision = metrics.precision_score(y_test, y_pred_class) print precision 0.555555555556 0.555555555556

F度量（又稱為F1分?jǐn)?shù)或F分?jǐn)?shù)），是使用精度和召回率的方法組合到一個(gè)度量上

F=2?precision?recallprecision+recall??F=2?precision?recallprecision+recall

F?β?=(1+β?2?)?precision?recallβ?2??precision+recall??Fβ=(1+β2)?precision?recallβ2?precision+recall

F?度量是精度和召回率的調(diào)和均值，它賦予精度和召回率相等的權(quán)重。

F?β??度量是精度和召回率的加權(quán)度量，它賦予召回率權(quán)重是賦予精度的β?倍。

In?[22]: print (2*precision*recall) / (precision+recall) print metrics.f1_score(y_test, y_pred_class) 0.337078651685 0.337078651685

總結(jié)

混淆矩陣賦予一個(gè)分類器性能表現(xiàn)更全面的認(rèn)識(shí)，同時(shí)它通過計(jì)算各種分類度量，指導(dǎo)你進(jìn)行模型選擇。

使用什么度量取決于具體的業(yè)務(wù)要求：

垃圾郵件過濾器：優(yōu)先優(yōu)化精度或者特效性，因?yàn)樵搼?yīng)用對(duì)假陽性（非垃圾郵件被放進(jìn)垃圾郵件箱）的要求高于對(duì)假陰性（垃圾郵件被放進(jìn)正常的收件箱）的要求
欺詐交易檢測(cè)器：優(yōu)先優(yōu)化靈敏度，因?yàn)樵搼?yīng)用對(duì)假陰性（欺詐行為未被檢測(cè)）的要求高于假陽性（正常交易被認(rèn)為是欺詐）的要求

5. 調(diào)整分類的閾值?

In?[23]: # print the first 10 predicted responses logreg.predict(X_test)[0:10] Out[23]: array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int64) In?[24]: y_test.values[0:10] Out[24]: array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0], dtype=int64) In?[25]: # print the first 10 predicted probabilities of class membership logreg.predict_proba(X_test)[0:10, :] Out[25]: array([[ 0.63247571, 0.36752429],[ 0.71643656, 0.28356344],[ 0.71104114, 0.28895886],[ 0.5858938 , 0.4141062 ],[ 0.84103973, 0.15896027],[ 0.82934844, 0.17065156],[ 0.50110974, 0.49889026],[ 0.48658459, 0.51341541],[ 0.72321388, 0.27678612],[ 0.32810562, 0.67189438]])

上面的輸出中，第一列顯示的是預(yù)測(cè)值為0的百分比，第二列顯示的是預(yù)測(cè)值為1的百分比。

In?[26]: # print the first 10 predicted probabilities for class 1 logreg.predict_proba(X_test)[0:10, 1] Out[26]: array([ 0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027,0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])

我們看到，預(yù)測(cè)為1的和實(shí)際的類別號(hào)差別很大，所以這里有50%作為分類的閾值顯然不太合理。于是我們將所有預(yù)測(cè)類別為1的百分比數(shù)據(jù)用直方圖的方式形象地表示出來，然后嘗試重新設(shè)置閾值。

In?[27]: # store the predicted probabilities for class 1 y_pred_prob = logreg.predict_proba(X_test)[:, 1] In?[28]: # allow plots to appear in the notebook %matplotlib inline import matplotlib.pyplot as plt In?[29]: # histogram of predicted probabilities plt.hist(y_pred_prob, bins=8) plt.xlim(0, 1) plt.title('Histogram of predicted probabilities') plt.xlabel('Predicted probability of diabetes') plt.ylabel('Frequency') Out[29]: <matplotlib.text.Text at 0x76853b0>

我們發(fā)現(xiàn)在20%-30%之間的數(shù)高達(dá)45%，故以50%作為分類閾值時(shí)，只有很少的一部分?jǐn)?shù)據(jù)會(huì)被認(rèn)為是類別為1的情況。我們可以將閾值調(diào)小，以改變分類器的靈敏度和特效性。

In?[30]: # predict diabetes if the predicted probability is greater than 0.3 from sklearn.preprocessing import binarize y_pred_class = binarize(y_pred_prob, 0.3)[0] In?[31]: # print the first 10 predicted probabilities y_pred_prob[0:10] Out[31]: array([ 0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027,0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438]) In?[32]: # print the first 10 predicted classes with the lower threshold y_pred_class[0:10] Out[32]: array([ 1., 0., 0., 1., 0., 0., 1., 1., 0., 1.]) In?[33]: y_test.values[0:10] Out[33]: array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0], dtype=int64)

從上面兩組數(shù)據(jù)對(duì)比來看，效果確實(shí)改善不少

In?[34]: # previous confusion matrix (default threshold of 0.5) print confusion [[118 12][ 47 15]] In?[35]: # new confusion matrix (threshold of 0.3) print metrics.confusion_matrix(y_test, y_pred_class) [[80 50][16 46]] In?[36]: # sensitivity has increased (used to be 0.24) print 46 / float(46 + 16) print metrics.recall_score(y_test, y_pred_class) 0.741935483871 0.741935483871 In?[37]: # specificity has decreased (used to be 0.91) print 80 / float(80 + 50) 0.615384615385

總結(jié)：

0.5作為閾值時(shí)默認(rèn)的情況
調(diào)節(jié)閾值可以改變靈敏性和特效性
靈敏性和特效性是一對(duì)相反作用的指標(biāo)
該閾值的調(diào)節(jié)是作為改善分類性能的最后一步，應(yīng)更多去關(guān)注分類器的選擇或構(gòu)建更好的分類器

6. ROC曲線和AUC?

ROC曲線指受試者工作特征曲線/接收器操作特性(receiver operating characteristic，ROC)曲線, 是反映靈敏性和特效性連續(xù)變量的綜合指標(biāo),是用構(gòu)圖法揭示敏感性和特異性的相互關(guān)系，它通過將連續(xù)變量設(shè)定出多個(gè)不同的臨界值，從而計(jì)算出一系列敏感性和特異性。

ROC曲線是根據(jù)一系列不同的二分類方式（分界值或決定閾），以真正例率（也就是靈敏度）（True Positive Rate,TPR）為縱坐標(biāo)，假正例率（1-特效性）（False Positive Rate,FPR）為橫坐標(biāo)繪制的曲線。

ROC觀察模型正確地識(shí)別正例的比例與模型錯(cuò)誤地把負(fù)例數(shù)據(jù)識(shí)別成正例的比例之間的權(quán)衡。TPR的增加以FPR的增加為代價(jià)。ROC曲線下的面積是模型準(zhǔn)確率的度量。

In?[38]: # IMPORTANT: first argument is true values, second argument is predicted probabilities fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob) plt.plot(fpr, tpr) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.title('ROC curve for diabetes classifier') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True)

ROC曲線上的每一個(gè)點(diǎn)對(duì)應(yīng)于一個(gè)threshold，對(duì)于一個(gè)分類器，每個(gè)threshold下會(huì)有一個(gè)TPR和FPR。比如Threshold最大時(shí)，TP=FP=0，對(duì)應(yīng)于原點(diǎn)；Threshold最小時(shí)，TN=FN=0，對(duì)應(yīng)于右上角的點(diǎn)(1,1)

正如上面所述，TPR的增加以FPR的增加為代價(jià)，所以ROC曲線可以幫助我們選擇一個(gè)可以平衡靈敏性和特效性的閾值。通過ROC曲線我們沒法看到響應(yīng)閾值的對(duì)應(yīng)關(guān)系，所以我們用下面的函數(shù)來查看。

In?[39]: # define a function that accepts a threshold and prints sensitivity and specificity def evaluate_threshold(threshold):print 'Sensitivity:', tpr[thresholds > threshold][-1]print 'Specificity:', 1 - fpr[thresholds > threshold][-1] In?[40]: evaluate_threshold(0.5) Sensitivity: 0.241935483871 Specificity: 0.907692307692 In?[41]: evaluate_threshold(0.3) Sensitivity: 0.741935483871 Specificity: 0.615384615385

AUC（Area Under Curve）被定義為ROC曲線下的面積，也可以認(rèn)為是ROC曲線下面積占單位面積的比例，顯然這個(gè)面積的數(shù)值不會(huì)大于1。又由于ROC曲線一般都處于y=x這條直線的上方，所以AUC的取值范圍在0.5和1之間。

對(duì)應(yīng)AUC更大的分類器效果更好。所以AUC是衡量分類器性能的一個(gè)很好的度量，并且它不像分類準(zhǔn)確率那樣，在類別比例差別很大的情況下，依然是很好的度量手段。在欺詐交易檢測(cè)中，由于欺詐案例是很小的一部分，這時(shí)分類準(zhǔn)確率就不再是一個(gè)良好的度量，而可以使用AUC來度量。

In?[42]: # IMPORTANT: first argument is true values, second argument is predicted probabilities print metrics.roc_auc_score(y_test, y_pred_prob) 0.724565756824 In?[43]: # calculate cross-validated AUC from sklearn.cross_validation import cross_val_score cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean() Out[43]: 0.73782336182336183

參考資料?

scikit-learn documentation:?Model evaluation
ROC曲線-閾值評(píng)價(jià)標(biāo)準(zhǔn)

總結(jié)

以上是生活随笔為你收集整理的机器学习知识点(三十六)分类器性能度量指标f1-score的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：【正一专栏】识时务者为俊杰——致敬杜兰特
下一篇：【正一专栏】希望才是深深让人绝望的东西-