當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

基于随机森林模型的心脏病患者预测及可视化（pdpbox、eli5、shap、graphviz库）附相关库安装教程

發布時間：2023/12/9 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了基于随机森林模型的心脏病患者预测及可视化（pdpbox、eli5、shap、graphviz库）附相关库安装教程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

前言
一、項目流程
二、PDPBOX、ELI5、SHAP、SEABORN庫
三、項目詳解：
- 1.引入庫
- 2.數據預處理和類型轉化
- - 1).導入數據
  - 2).缺失值情況
  - 3).設置字段
  - 4).字段轉化
- 3.隨機森林模型建立與解釋
- - 1).切分數據
  - 2).建立模型
- 4.決策樹可視化
- 5.基于混淆矩陣的分類評價指標
- - 1).混淆矩陣
  - 2).計算sensitivity and specificity
  - 3).繪制ROC曲線
- 6.部分依賴圖PDP的繪制和解釋
- - 1).排列重要性
  - 2).一維PDP
  - 3).2D-PDP圖
- 7.AutoML機器學習SHAP庫的使用和解釋
總結

前言

Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first used in a simple random forest model, and then the model is investigated using ML explainability tools and techniques.

一、項目流程

數據預處理和類型轉化

隨機森林模型建立與解釋

決策樹可視化

基于混淆矩陣的分類評價指標

部分依賴圖PDP的繪制和解釋

AutoML機器學習SHAP庫的使用和解釋

二、PDPBOX、ELI5、SHAP、SEABORN庫

前提：
因為在做機器學習項目時會引入第三方庫，筆者建議新建一個conda環境安裝相關庫，以避免庫與庫之間的沖突。故新建一個名為project的conda環境，具體代碼如下：

1. conda create -n project1 python==3.7

當出現$conda activate project1時，代表project1已經創建完成

2.conda activate project1

進入project1環境

本項目主要庫為：pdpbox、eli5、shap、seaborn。接下來逐一介紹：

PDPBOX：
PDP(Partial Dependence Plot) 是一個顯示特征對機器學習模型預測結果的邊際影響的圖。它用于評估特征與目標之間的相關性是線性的、單調的還是更復雜的。
安裝：

1.pip install pdpbox

ELI5：
ELI5 是一個 Python 包，有助于機器學習的可解釋性。
安裝：

2.pip install eli5

SHAP：
SHAP是一種博弈論方法，用來解釋任何機器學習模型的輸出。
安裝：

3.pip install shap

SEABORN

4.pip install seaborn

三、項目詳解：

1.引入庫

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import export_graphviz from sklearn.metrics import roc_curve, auc from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split import eli5 from eli5.sklearn import PermutationImportance import shap from pdpbox import pdp, info_plots np.random.seed(123) pd.options.mode.chained_assignment = None

2.數據預處理和類型轉化

1).導入數據

獲取心臟數據，提取碼：ykyh

dt = pd.read_csv("heart.csv") dt.head().append(dt.tail())

讀取數據，并輸出首尾5行

2).缺失值情況

dt.isnull().sum()

觀察可得沒有任何缺失值

3).設置字段

dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

各個字段的含義：
age：年齡
sex 性別 1=male 0=female
cp 胸痛類型；4種取值情況
1：典型心絞痛
2：非典型心絞痛
3：非心絞痛
4：無癥狀
trestbps 靜息血壓
chol 血清膽固醇
fbs 空腹血糖 >120mg/dl ：1=true；0=false
restecg 靜息心電圖(值0,1,2)
thalach 達到的最大心率
exang 運動誘發的心絞痛(1=yes;0=no)
oldpeak 相對于休息的運動引起的ST值(ST值與心電圖上的位置有關)
slope 運動高峰ST段的坡度
1：upsloping向上傾斜
2：flat持平
3：downsloping向下傾斜
ca The number of major vessels(血管) (0-3)
thal A blood disorder called thalassemia ，一種叫做地中海貧血的血液疾病(3 = normal；6 = fixed defect;；7 = reversable defect)
target 生病沒有(0=no；1=yes)

4).字段轉化

dt['sex'][dt['sex'] == 0] = 'female' dt['sex'][dt['sex'] == 1] = 'male' dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'typical angina' dt['chest_pain_type'][dt['chest_pain_type'] == 2] = 'atypical angina' dt['chest_pain_type'][dt['chest_pain_type'] == 3] = 'non-anginal pain' dt['chest_pain_type'][dt['chest_pain_type'] == 4] = 'asymptomatic' dt['fasting_blood_sugar'][dt['fasting_blood_sugar'] == 0] = 'lower than 120mg/ml' dt['fasting_blood_sugar'][dt['fasting_blood_sugar'] == 1] = 'greater than 120mg/ml' dt['rest_ecg'][dt['rest_ecg'] == 0] = 'normal' dt['rest_ecg'][dt['rest_ecg'] == 1] = 'ST-T wave abnormality' dt['rest_ecg'][dt['rest_ecg'] == 2] = 'left ventricular hypertrophy' dt['exercise_induced_angina'][dt['exercise_induced_angina'] == 0] = 'no' dt['exercise_induced_angina'][dt['exercise_induced_angina'] == 1] = 'yes' dt['st_slope'][dt['st_slope'] == 1] = 'upsloping' dt['st_slope'][dt['st_slope'] == 2] = 'flat' dt['st_slope'][dt['st_slope'] == 3] = 'downsloping' dt['thalassemia'][dt['thalassemia'] == 1] = 'normal' dt['thalassemia'][dt['thalassemia'] == 2] = 'fixed defect' dt['thalassemia'][dt['thalassemia'] == 3] = 'reversable defect' dt.dtypes

字段類型轉化

dt['sex'] = dt['sex'].astype('object') dt['chest_pain_type'] = dt['chest_pain_type'].astype('object') dt['fasting_blood_sugar'] = dt['fasting_blood_sugar'].astype('object') dt['rest_ecg'] = dt['rest_ecg'].astype('object') dt['exercise_induced_angina'] = dt['exercise_induced_angina'].astype('object') dt['st_slope'] = dt['st_slope'].astype('object') dt['thalassemia'] = dt['thalassemia'].astype('object')

生成啞變量

dt = pd.get_dummies(dt, drop_first=True) dt.head()

3.隨機森林模型建立與解釋

1).切分數據

X_train, X_test, y_train, y_test = train_test_split(dt.drop('target', 1), dt['target'], test_size = .2, random_state=10)

2).建立模型

model = RandomForestClassifier(max_depth=5) model.fit(X_train, y_train) estimator = model.estimators_[1] feature_names = [i for i in X_train.columns] y_train_str = y_train.astype('str') y_train_str[y_train_str == '0'] = 'no disease' y_train_str[y_train_str == '1'] = 'disease' y_train_str = y_train_str.values y_train_str[:5]

4.決策樹可視化

數據獲取提取碼：h0cz

import pandas as pd import numpy as np from sklearn import tree from sklearn.model_selection import train_test_split df_t=pd.read_excel(r'heart.xlsx') arr_t=df_t.values.astype(np.float32) arr_t Xtrain,Xtest,Ytrain,Ytest = train_test_split(arr_t[:,:-1],arr_t[:,-1],test_size=0.3)

實例化決策樹，訓練模型，查看正確率

dtc = tree.DecisionTreeClassifier(criterion="entropy",max_depth=4,min_samples_split=10).fit(Xtrain,Ytrain) score = dtc.score(Xtest,Ytest) score

最終準確率為:0.8021978021978022

在可視化之前需要安裝graphviz
a.去官網下載graphviz,下載并安裝對應的exe即可
b.pip uninstall graphviz
c.conda install python-graphviz
d.配置環境變量
User Path：C:\Program Files \Graphviz2.38\bin
System Path：C:\Program Files \Graphviz2.38\bin\dot.exe

接下來進行可視化：

graph_tree = graphviz.Source(tree.export_graphviz(dtc,feature_names = df_t.keys()[:-1],class_names = ['患病','不患病'],filled = True,rounded = True)) graph_tree

5.基于混淆矩陣的分類評價指標

1).混淆矩陣

confusion_matrix = confusion_matrix(y_test, y_pred_bin) confusion_matrix

2).計算sensitivity and specificity

total=sum(sum(confusion_matrix)) sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0]) print('Sensitivity : ', sensitivity ) specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1]) print('Specificity : ', specificity)

計算結果如圖所示：

3).繪制ROC曲線

fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant) fig, ax = plt.subplots() ax.plot(fpr, tpr) ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3") plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for diabetes classifier') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True)

最終的ROC曲線值：

由一般ROC曲線的評價標準，評分大于0.90視為極好，本項目的表現結果還是可觀的。

6.部分依賴圖PDP的繪制和解釋

1).排列重要性

perm = PermutationImportance(model, random_state=1).fit(X_test, y_test) eli5.show_weights(perm, feature_names = X_test.columns.tolist())

2).一維PDP

Partial Dependence就是用來解釋某個特征和目標值y的關系的，一般是通過畫出Partial Dependence Plot(PDP)來體現。也就是說PDP在X1的值，就是把訓練集中第一個變量換成X1之后，原模型預測出來的平均值。
查看單個特征和目標值的關系
字段num_major_vessels

base_features = dt.columns.values.tolist() base_features.remove('target') feat_name = 'num_major_vessels' pdp_dist = pdp.pdp_isolate(model=model, dataset=X_test, model_features=base_features, feature=feat_name) pdp.pdp_plot(pdp_dist, feat_name) plt.show()

字段age

feat_name = 'age' pdp_dist = pdp.pdp_isolate(model=model, dataset=X_test, model_features=base_features, feature=feat_name) pdp.pdp_plot(pdp_dist, feat_name) plt.show()

字段st_depression

feat_name = 'st_depression' pdp_dist = pdp.pdp_isolate(model=model, dataset=X_test, model_features=base_features, feature=feat_name) pdp.pdp_plot(pdp_dist, feat_name) plt.show()

3).2D-PDP圖

inter1 = pdp.pdp_interact(model=model, dataset=X_test, model_features=base_features, features=['st_slope_upsloping', 'st_depression']) pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['st_slope_upsloping', 'st_depression'], plot_type='contour') plt.show() inter1 = pdp.pdp_interact(model=model, dataset=X_test, model_features=base_features, features=['st_slope_flat', 'st_depression']) pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['st_slope_flat', 'st_depression'], plot_type='contour') plt.show()

7.AutoML機器學習SHAP庫的使用和解釋

在SHAP中進行模型解釋之前需要先創建一個explainer，本項目以tree為例
傳入隨機森林模型model,在explainer中傳入特征值的數據，計算shap值

explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values[1], X_test, plot_type="bar")

shap.summary_plot(shap_values[1], X_test)

a.每一行代表一個特征，橫坐標為SHAP值
b.一個點代表一個樣本，顏色表示特征值的高低(紅色高，藍色低)

個體差異
查看單個病人的不同特征屬性對其結果的影響

def heart_disease_risk_factors(model, patient):explainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(patient)shap.initjs()return shap.force_plot(explainer.expected_value[1], shap_values[1], patient) data_for_prediction = X_test.iloc[1,:].astype(float) heart_disease_risk_factors(model, data_for_prediction) data_for_prediction = X_test.iloc[3,:].astype(float) heart_disease_risk_factors(model, data_for_prediction)

*P1：預測準確率高達29%（baseline是57%），更多的因素集中在ca、thal_fixed_defect、oldpeak等藍色部分。
*P3：預測準確率高達82%，更多的影響因素在sel_male=0，thalach=143等
通過對比不同的患者，我們是可以觀察到不同病人之間的預測率和主要影響因素。

將單個feature的SHAP值與數據集中所有樣本的feature值進行比較

ax2 = fig.add_subplot(224) shap.dependence_plot('num_major_vessels', shap_values[1], X_test, interaction_index="st_depression")

多樣本可視化探索
將不同的特征屬性對前50個患者的影響進行可視化分析。這里以sample order by similarity和age為例

shap_values = explainer.shap_values(X_train.iloc[:50]) shap.force_plot(explainer.expected_value[1], shap_values[1], X_test.iloc[:50])

總結

運行環境：jupyter notebook
基于隨機森林模型的心臟病患者預測及可視化項目主要運用隨機森林去建立模型。可視化部分主要用到graphviz以及shap庫，這兩個庫在機器學習研究里面也是相當不錯的可視化模型庫。
筆者在安裝graphviz以及pdpbox庫中遇到不少問題，通過查閱官方文檔已全部解決。在做本項目之前，建議新建一個conda環境，便于增添相關庫以及避免庫與庫之間的沖突。

******歡迎讀者留言交流******

總結

以上是生活随笔為你收集整理的基于随机森林模型的心脏病患者预测及可视化（pdpbox、eli5、shap、graphviz库）附相关库安装教程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：与朱元思书
下一篇： Haproxy+多台MySQL从服务器(