當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

用Scikit-learn和TensorFlow进行机器学习（三）

發布時間：2025/3/19 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了用Scikit-learn和TensorFlow进行机器学习（三）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

分類
- 一、MNIST
- 二、獲取數據
- - 1、從網絡獲取
  - 2、本地讀取
- 三、訓練一個二分類器
- 四、性能評估
- - 1、交叉驗證——精度
  - - （1）輪子版 `cross_val_score()`
    - （2）函數版 `cross_val_score()`
    - （3）笨分類器
  - 2、混淆矩陣
  - - （1）準確率（precision）
    - （2）召回率（recall）
  - 3、準確率與召回率
  - 4、F1值
  - 5、準確率/召回率之間的折衷——PR曲線
  - 6、ROC 曲線
  - 7、PR曲線 vs. ROC曲線
- 四、多類分類
- - 1、二分類器 ==》多類分類器
- 五、誤差分析
- - 1、檢查混淆矩陣
  - 2、分析混淆矩陣
- 六、多標簽分類
- - 1、訓練預測
  - 2、評估
- 七、多輸出分類

分類

一、MNIST

MNIST數據集：70000 張規格較小的手寫數字圖片。

二、獲取數據

1、從網絡獲取

from sklearn.datasets import fetch_mldatamnist = fetch_mldata('MNIST original') print(mnist)

輸出結果

{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Sun Mar 30 03:19:02 2014', '__version__': '1.0', '__globals__': [], 'mldata_descr_ordering': array([[array(['label'], dtype='<U5'), array(['data'], dtype='<U4')]],dtype=object), 'data': array([[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],...,[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'label': array([[0., 0., 0., ..., 9., 9., 9.]])}

一般而言，sklearn 加載的數據集有著相似的字典結構，包括：__header__、__version__、__globals__、mldata_descr_ordering、data 和 label

2、本地讀取

會出現無法現在的情況，本博客提供數據集資源：
傳送門鏈接: https://pan.baidu.com/s/1VLD1CmMqWIoDotqf-9umUA 提取碼: exw9

from sklearn.datasets import fetch_mldata import scipy.io as sio import numpy as npmnist = sio.loadmat('./datasets/mnist/mnist-original.mat') print(mnist)X, y = mnist["data"].T, mnist["label"].T print(X.shape) print(y.shape)import matplotlib.pyplot as plt import matplotlib## 查看樣例 some_digit = X[39000] some_digit_image = some_digit.reshape(28, 28) plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest") plt.axis("off") plt.show() print(y[39000])## 創建測試集 # 前 60000 張圖片為訓練集 # 最后 10000 張圖片為測試集 X_train, X_test, y_train, y_test = X[:,60000], X[60000,:], y[:60000], y[60000:]## 亂序 import numpy as npshuffle_index = np.random.permutation(60000) X_train, y_train = X_train[shuffle_index],y_train[shuffle_index]

輸出結果

三、訓練一個二分類器

二分類：類別為 “是6” 和 “非6”

from sklearn.linear_model import SGDClassifier## 創建分類標簽 y_train_6 = (y_train == 6) y_test_6 = (y_test == 6)## 隨機梯度下降 SGD 分類器 # SGD 一次只處理一條數據 ==》在線學習(Online Learning) sgd_clf = SGDClassifier(random_state=2019) sgd_clf.fit(X_train, y_train_6)## 預測 print(sgd_clf.predict([some_digit]))

輸出結果

[ True]

四、性能評估

1、交叉驗證——精度

K折交叉驗證：將訓練集分成K折，然后使用一個模型對其中一折進行預測，對其他折進行訓練。

（1）輪子版 cross_val_score()

過程：StratifiedKFold 類實現了分層采樣，生成的折（fold）包含了各類相應比例的樣例。在每一次迭代，下面的代碼生成分類器的一個克隆版本，在訓練折（training folds）的克隆版本上進行訓練，在測試折（test folds）上進行測試。最后計算準確率

from sklearn.model_selection import StratifiedShuffleSplit from sklearn.base import cloneskfolds = StratifiedShuffleSplit(n_splits=3, random_state=2019) for train_index, test_index in skfolds.split(X_train, y_train_6):clone_clf = clone(sgd_clf)X_train_folds = X_train[train_index]y_train_folds = (y_train_6[train_index])X_test_fold = X_train[test_index]y_test_fold = (y_train_6[test_index])clone_clf.fit(X_train_folds, y_train_folds.ravel())y_pred = clone_clf.predict(X_test_fold)n_correct = sum(y_pred == y_test_fold.ravel())print(n_correct / len(y_pred))

輸出結果

0.982 0.9753333333333334 0.976

（2）函數版 cross_val_score()

from sklearn.model_selection import cross_val_score score = cross_val_score(sgd_clf, X_train, y_train_6, cv=3, scoring="accuracy") print(score) ## 精度 # [0.9820009 0.97985 0.98024901]

輸出結果

[0.97780111 0.982 0.98429921]

（3）笨分類器

from sklearn.base import BaseEstimator class Never6Classifier(BaseEstimator):def fit(self, X, y=None):passdef predict(self, X):return np.zeros((len(X), 1), dtype=bool) never_6_clf = Never6Classifier() score6 = cross_val_score(never_6_clf, X_train, y_train_6, cv=3, scoring="accuracy") print(score6)

輸出結果

[0.90045 0.9028 0.90085]

由于數據的分布，致使笨分類器也有90%的精度
==》精度通常不是很好的性能度量指標，特別是處理有偏差的數據集，eg：數據不平衡：其中一些類比其他類頻繁得多

2、混淆矩陣

cross_val_predict() 函數同樣使用 K 折交叉驗證。返回每一個測試數據的預測值；
confusion_matrix() 函數，可獲得一個混淆矩陣，參數為groundtruth和預測值

思想：類別 A 被分類成類別 B 的次數，eg：為了知道分類器將 5 誤分為 3 的次數，你需要查看混淆矩陣的第五行第三列。

from sklearn.model_selection import cross_val_predict from sklearn.metrics import confusion_matrixy_train_pred = cross_val_predict(sgd_clf, X_train, y_train_6, cv=3) print(confusion_matrix(y_train_6, y_train_pred))

輸出結果

[[53419 663][ 529 5389]]

解讀：

混淆矩陣中的每一行表示一個實際的類, 而每一列表示一個預測的類；
- 該矩陣的第一行認為“非6”（反例）中的 53419 張被正確歸類為 “非 6”（他們被稱為真反例， true negatives） , 而其余663被錯誤歸類為"是 6" （假正例， false positives）。第二行認為“是 6” （正例）中的 529 被錯誤地歸類為“非 6”（假反例， false negatives），其余 5389 正確分類為 “是 6”類（真正例， true positives）
完美的分類器將只有真反例和真正例，所以混淆矩陣的非零值僅在其主對角線（左上至右下）

（1）準確率（precision）

$precision=TPTP+FPprecision=\frac{TP}{TP+FP}$
其中，TP表示真正例的數目，FP表示假正例的數目

（2）召回率（recall）

召回率，也稱敏感度（sensitivity）或者真正例率（true positive rate，TPR）：正例被分類器正確探測出的比例。
$recall=TPTP+FNrecall=\frac{TP}{TP+FN}$
其中，FN表示假反例的數目。

3、準確率與召回率

from sklearn.metrics import precision_score, recall_scoreprecision = precision_score(y_train_6, y_train_pred) recall = recall_score(y_train_6, y_train_pred) print("The precision is ", precision) print("The recall is ", recall)

輸出結果

The precision is 0.8694196428571429 The recall is 0.921426157485637

4、F1值

F1值是準確率和召回率的調和平均。調和平均會給小的值更大的權重，若要得到一個高的F1值，需要召回率和準確率同時高。
$F1=21precision+1recall=2?precision?recallprecision+recall=TPTP+FN+FP2F1=\frac{2}{\frac{1}{precision}+\frac{1}{recall}}=2*\frac{precision*recall}{precision +recall}=\frac{TP}{TP+\frac{FN+FP}{2}}$

調用 f1_score() 即可獲得F1值

from sklearn.metrics import f1_scoref1 = f1_score(y_train_6,y_train_pred) print("The F1 score is ", f1)

輸出結果

The F1 score is 0.8615836283567914

5、準確率/召回率之間的折衷——PR曲線

根據使用的場景不同，會更注重召回率或準確率，增加準確率會降低召回率，反之亦然。
==》準確率與召回率之間的折衷

預測過程：通過將預測值與閾值進行對比，分別正例和反例。通過降低閥值可以提高召回率、降低準確率。

sklearn中通過設置決策分數的方法，調用 decision_function() 方法，該方法返回每一個樣例的分數值，然后基于這個分數值，使用自定義閥值做出預測。

y_scores = sgd_clf.decision_function([some_digit]) print(y_scores)## 設置閥值1 threshould = 0 y_some_digit_pred = (y_scores > threshould) print(y_some_digit_pred)## 設置閥值2 threshould = 200000 y_some_digit_pred = (y_scores > threshould) print(y_some_digit_pred)

輸出結果

[97250.73888009] [ True] [False]

==》提高閥值會降低召回率
==》閥值選擇

from sklearn.metrics import precision_recall_curve## 返回決策分數，而非預測值 y_scores = cross_val_predict(sgd_clf, X_train, y_train_6, cv=3, method="decision_function") precisions, recalls, threshoulds = precision_recall_curve(y_train_6, y_scores)def plot_precision_recall_vs_threshold(precisions, recalls, threshoulds):plt.plot(threshoulds, precisions[:-1],'b--', label="Precision")plt.plot(threshoulds, recalls[:-1],"g-", label="Recall")plt.xlabel("Threshold")plt.legend(loc="upper left")plt.ylim([0, 1]) plot_precision_recall_vs_threshold(precisions, recalls, threshoulds) plt.show()# 要達到90%的準確率 y_train_pred_90 = (y_scores > 70000) precision = precision_score(y_train_6, y_train_pred_90) recall = recall_score(y_train_6, y_train_pred_90) print("The precision is ", precision) print("The recall is ", recall)

輸出結果

The precision is 0.9231185706551164 The recall is 0.8643122676579925

6、ROC 曲線

受試者工作特征（ ROC）曲線是真正例率（true positive rate，TPR，也稱召回率）對假正例率（false positive rate，FPR）的曲線。

需要計算不同閥值下的TPR、FPR，使用roc_curve()函數

## ROC曲線 from sklearn.metrics import roc_curve fpr, tpr, threshoulds = roc_curve(y_train_6, y_scores) def plot_roc_curve(fpr, tpr, label=None):plt.plot(fpr, tpr, linewidth=2, label=label)plt.plot([0, 1], [0, 1], 'k--')plt.axis([0, 1, 0, 1])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.legend(loc="lower right") plot_roc_curve(fpr, tpr) plt.show()

比較分類器之間優劣的方法：測量ROC曲線下的面積（AUC）——完美分類器 ROC AUC等于1，一個純隨機分類器的ROC AUC等于0.5

from sklearn.metrics import roc_auc_score auc = roc_auc_score(y_train_6, y_scores) print(auc)

輸出結果

0.9859523237334556

7、PR曲線 vs. ROC曲線

優先使用PR曲線當正例很少或關注假正例多于假反例的時候，其他情況使用ROC曲線

from sklearn.ensemble import RandomForestClassifier ## 不提供decision_function()方法，提供predict_proba()方法 forest_clf = RandomForestClassifier(random_state=2019) y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_6, cv=3, method="predict_proba")# 使用正例的概率作為樣例的分數 y_scores_forest = y_probas_forest[:, 1] fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_6, y_scores_forest)plt.plot(fpr, tpr, "b:", label="SGD") plot_roc_curve(fpr_forest, tpr_forest, "Random Forest") plt.legend(loc="bottom right") plt.show()auc_forest = roc_auc_score(y_train_6, y_scores_forest) print("The AUC is ", auc_forest)

輸出結果

The AUC is 0.9956826118210167

分析：RandomForest的ROC曲線比SGDClassifier好：它更靠近左上角。

四、多類分類

可以直接處理多分類器的算法：隨機森林、樸素貝葉斯
嚴格二分類器：SVM、線性分類器

1、二分類器 ==》多類分類器

（eg：要分為10類）

一對所有（OvA）策略：訓練10個分類器，每個對應一個分類的類別（類別1與其他，類別2與其他…）
一對一（OvO）策略：對每個分類類別都訓練一個二分類器。若有N個類，需要訓練 N*(N-1)/2 個分類器。
- 優點：每個分類器只需要在訓練集的部分數據上面進行訓練。這部分數據是它所需要區分的那兩類對應的數據。

對于一些算法（eg：SVM）在訓練集上的大小很難擴展==》OvO（可在小數據集上更多的訓練）
大數據集==》OvA

在sklearn中，使用二分類器完成多分類，自動執行OvA（SVM為OvO）

sgd_clf.fit(X_train, y_train) print(sgd_clf.predict([some_digit]))some_digit_scores = sgd_clf.decision_function([some_digit]) print(some_digit_scores) ## 最大值的類別 print(np.argmax(some_digit_scores)) ## 獲取目標類別 print(sgd_clf.classes_) print(sgd_clf.classes_[6])

輸出結果

[6.] [[-493795.59394766 -316594.71495827 -59032.18005876 -300444.77319706-434956.4672297 -292368.411729 276453.49558919 -750703.98392662-296673.25971762 -565079.84324395]] 6 [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] 6.0

強制使用OvO或OvA策略：OneVsOneClassifier, OneVsRestClassifier

## 創建基于SGDClassifier的OvO策略的分類器 from sklearn.multiclass import OneVsOneClassifier ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=2019)) ovo_clf.fit(X_train, y_train) ovo = ovo_clf.predict([some_digit]) print(ovo) print(len(ovo_clf.estimators_)) # 獲得分類器的個數## 訓練一個RandomForestClassifier forest_clf.fit(X_train, y_train) forest = forest_clf.predict([some_digit]) print(forest) # 得到樣例對應的類別的概率值的列表 forest_proba = forest_clf.predict_proba([some_digit]) print(forest_proba) # 交叉驗證評估分類器 forest_score = cross_val_score(forest_clf, X_train, y_train, cv=3, scoring="accuracy") print(forest_score)## 加入預處理：將輸入正則化 from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train.astype(np.float64)) score_std = cross_val_score(forest_clf, X_train_scaled, y_train, cv=3, scoring="accuracy") print(score_std)

輸出結果

[6.] 45 [6.] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]] [0.940012 0.93944697 0.94039106] [0.940012 0.93949697 0.94034105]

五、誤差分析

當得到一個不錯的模型并需要改善它，則需要分析模型產生的誤差類型

1、檢查混淆矩陣

cross_val_predict() 做出預測 =》 confusion_matrix() 計算混淆矩陣 =》 matshow()顯示混淆矩陣

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3) conf_mx = confusion_matrix(y_train, y_train_pred) print(conf_mx) ## 以圖像的形式顯示混淆矩陣 plt.matshow(conf_mx, cmap=plt.cm.gray) plt.show()

輸出結果

2、分析混淆矩陣

分析：數字5對應的格子比其他數字要暗淡許多。
可能原因：1. 數據集中數字5的圖片比較少；2.分類器對于數字5的表現不如其他數字好

比較錯誤率，而不是絕對的錯誤數。方法：將混淆矩陣的每一個值除以相應類別（真實值的個數）的圖片的總數目。

rows_sums = conf_mx.sum(axis=1, keepdims=True) norm_conf_mx = conf_mx / rows_sums ## 使用0對對角線進行填充 np.fill_diagonal(norm_conf_mx, 0) plt.matshow(norm_conf_mx, cmap=plt.cm.gray) plt.show()

注意：行代表實際類別，列代表預測的類別。不是嚴格對稱的。
第8、9列亮，表示許多圖片被誤分類為數字8或數字9；特別黑，代表大多數被正確分類；將數字 8 誤分類為數字 5 的數量，有更多的數字 5 被誤分類為數字 8。
==》努力改善分類器在數字8和數字9上的表現，糾正3/5的混淆。
==》收集數據、構建新的特征、對輸入進行預處理（eg：圖片預處理來確保它們可以很好地中心化和不過度旋轉）

六、多標簽分類

輸出多個標簽的分類系統稱為多標簽分類系統。

1、訓練預測

from sklearn.neighbors import KNeighborsClassifier# 創建 y_multilabel 數組，里面包含兩個目標標簽。 y_train_large = (y_train >= 7) y_train_odd = (y_train % 2 == 1) y_multilabel = np.c_[y_train_large, y_train_odd] knn_clf = KNeighborsClassifier() knn_clf.fit(X_train, y_multilabel) pred_knn = knn_clf.predict([some_digit]) print(pred_knn)

輸出結果
數字6不是大數，同時不是奇數

[[False False]]

2、評估

評估分類器、選擇正確的度量標準

對每個個體標簽去度量 F1 值，然后計算平均值。

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3) f1_score_knn = f1_score(y_train, y_train_knn_pred, average="macro") print(f1_score_knn)

輸出結果

0.9684568539645069

標簽的權重，eg：標簽權重等于支持度（該標簽的樣例的數目），將 average="weighted"

七、多輸出分類

多輸出-多類分類，簡稱多輸出分類。

例子：圖片去噪，輸出是多標簽的（一個像素一個標簽）和每個標簽可以有多個值（像素強度取值范圍從0到255），所以是一個多輸出分類系統。

import random as rnd ## 添加噪聲 noise_train = rnd.randint(0, 100, len(X_train), 784) noise_test = rnd.randint(0, 100, len(X_test), 784) X_train_mod = X_train + noise_train X_test_mod = X_test + noise_test y_train_mod = X_train y_test_mod = X_test## 訓練、預測 knn_clf.fit(X_train_mod, y_train_mod) clean_digit = knn_clf.predict(X_test_mod[some_index]) plot_digit(clean_digit)

輸出結果

總結

以上是生活随笔為你收集整理的用Scikit-learn和TensorFlow进行机器学习（三）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：用Scikit-learn和Tensor
下一篇：林轩田机器学习基石课程笔记1 -The

编程问答

用Scikit-learn和TensorFlow进行机器学习（三）

文章目錄

分類

一、MNIST

二、獲取數據

1、從網絡獲取

2、本地讀取

三、訓練一個二分類器

四、性能評估

1、交叉驗證——精度

（1）輪子版 cross_val_score()

（2）函數版 cross_val_score()

（3）笨分類器

2、混淆矩陣

（1）準確率（precision）

（2）召回率（recall）

3、準確率與召回率

4、F1值

5、準確率/召回率之間的折衷——PR曲線

6、ROC 曲線

7、PR曲線 vs. ROC曲線

四、多類分類

1、二分類器 ==》多類分類器

五、誤差分析

1、檢查混淆矩陣

2、分析混淆矩陣

六、多標簽分類

1、訓練預測

2、評估

七、多輸出分類

總結

二、獲取數據

2、本地讀取

四、性能評估

1、交叉驗證——精度

2、混淆矩陣

4、F1值

5、準確率/召回率之間的折衷——PR曲線

四、多類分類

1、二分類器 ==》多類分類器

2、評估