基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客
基于決策樹的多分類
Article Outline
文章大綱
- What is a decision tree? 什么是決策樹?
- Why use them? 為什么要使用它們?
- Data Background 資料背景
- Descriptive Statistics 描述性統計
- Decision Tree Training and Evaluation 決策樹培訓和評估
- Decision Tree Pruning 決策樹修剪
- Hyperparameters Tuning 超參數調整
什么是決策樹? (What is a decision tree?)
A decision tree is a representation of a flowchart. The classification and regression tree (a.k.a decision tree) algorithm was developed by Breiman et al. 1984 (usually reported) but that certainly was not the earliest. Wei-Yin Loh of the University of Wisconsin has written about the history of decision trees. You can read it here “Fifty Years of Classification and Regression Trees”.
決策樹是流程圖的表示。 分類和回歸樹(又名決策樹)算法是由Breiman等人開發的。 1984年 ( 通常報道 ),但這當然不是最早的。 威斯康星大學的盧偉賢(Loe-Yin Yin)撰寫了有關決策樹的歷史。 您可以在這里閱讀“ 分類樹和回歸樹五十年 ”。
In a decision tree, the top node is called the “root node” and the bottom node “terminal node”. The other nodes are called “internal nodes” which includes a binary split condition, while each leaf node contains associated class labels.
在決策樹中,頂部節點稱為“根節點”,而底部節點稱為“終端節點”。 其他節點稱為“內部節點”,其中包含二進制拆分條件,而每個葉節點均包含關聯的類標簽。
Photo by Saed Sayad on saedsayad.com Saed Sayad在saedsayad.com上的照片A classification tree uses a split condition to predict a class label based on the provided input variables. The splitting process starts from the top node (root node), and at each node, it checks whether supplied input values recursively continue to the left or right according to a supplied splitting condition (Gini or Information gain). This process terminates when a leaf or terminal node is reached.
分類樹使用拆分條件基于提供的輸入變量來預測類標簽。 拆分過程從最高節點(根節點)開始,并在每個節點處根據提供的拆分條件(Gini或信息增益)檢查提供的輸入值是遞歸地在左側還是右側。 當到達葉節點或終端節點時,此過程終止。
為什么要使用它們? (Why use them?)
A single decision tree-based model is easy to build, plot and interpret which makes this algorithm so popular. You can use this algorithm for performing classification as well as a regression task.
基于單個決策樹的模型易于構建,繪制和解釋,這使得該算法如此受歡迎。 您可以使用此算法執行分類以及回歸任務。
資料背景 (Data Background)
In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).
在本示例中,我們將使用從機器學習數據庫的UCI存儲庫中獲得的Pima Indian Diabetes 2數據集( Newman等,1998 )。
This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
該數據集最初來自美國糖尿病與消化及腎臟疾病研究所。 數據集的目的是根據數據集中包含的某些診斷測量值來診斷性預測患者是否患有糖尿病。 從較大的數據庫中選擇這些實例受到一些限制。 特別是,這里的所有患者均為皮馬印第安人血統至少21歲的女性。
The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.
Pima印度糖尿病2數據集是Pima印度糖尿病數據的精煉版本(所有缺失值均指定為NA)。 數據集包含以下獨立變量和因變量。
Independent variables (symbol: I)
自變量(符號:I)
I1: pregnant: Number of times pregnant
I1: 懷孕 :懷孕次數
I2: glucose: Plasma glucose concentration (glucose tolerance test)
I2: 葡萄糖 :血漿葡萄糖濃度(葡萄糖耐量試驗)
I3: pressure: Diastolic blood pressure (mm Hg)
I3: 壓力 :舒張壓(毫米汞柱)
I4: triceps: Triceps skinfold thickness (mm)
I4: 三頭肌 :三頭肌皮褶厚度(毫米)
I5: insulin: 2-Hour serum insulin (mu U/ml)
I5: 胰島素 :2小時血清胰島素(mu U / ml)
I6: mass: Body mass index (weight in kg/(height in m)\2)
I6: 質量 :體重指數(重量,單位:千克/(身高,單位:m)\2)
I7: pedigree: Diabetes pedigree function
I7: 譜系 :糖尿病譜系功能
I8: age: Age (years)
I8: 年齡 :年齡(年)
Dependent Variable (symbol: D)
因變量(符號:D)
D1: diabetes: diabetes case (pos/neg)
D1: 糖尿病 :糖尿病病例(正/負)
建模目的 (Aim of the Modelling)
- fitting a decision tree classification machine learning model that accurately predicts whether or not the patients in the data set have diabetes 擬合決策樹分類機器學習模型,該模型可準確預測數據集中的患者是否患有糖尿病
- Decision tree pruning for reducing overfitting 決策樹修剪以減少過度擬合
- Decision tree hyperparameters tuning 決策樹超參數調整
加載相關庫 (Loading relevant libraries)
The first step of data analysis starts with loading relevant libraries.
數據分析的第一步從加載相關庫開始。
library(mlbench) # Diabetes datasetlibrary(rpart) # Decision tree
library(rpart.plot) # Plotting decision tree
library(caret) # Accuracy estimation
library(Metrics) # For diferent model evaluation metrics
加載數據集 (Loading dataset)
The very next step is to load the data into the R environment. As this comes with mlbench package one can load the data calling data( ).
下一步是將數據加載到R環境中。 由于mlbench軟件包隨附此軟件包,因此可以加載調用data()的數據。
# load the diabetes datasetdata(PimaIndiansDiabetes2)
數據預處理 (Data Preprocessing)
The next step would be to perform exploratory analysis. First, we need to remove the missing values using the na.omit( ) function. Print the data types using glimpse( ) method from dplyr library. You can see that all the variables except the dependent variable (diabetes: categorical/factor) are double type.
下一步將進行探索性分析。 首先,我們需要使用na.omit()函數刪除丟失的值。 使用dplyr庫中的glimpse()方法打印數據類型。 您會看到除因變量(糖尿病:分類/因子)以外的所有變量都是雙精度類型。
Diabetes <- na.omit(PimaIndiansDiabetes2) # Data for modelingdplyr::glimpse(Diabetes)Data Types資料類型訓練和測試拆分 (Train and Test Split)
The next step is to split the dataset into 80% train and 20% test. Here, we are using the sample( ) method to randomly pick the observation index for train and test split with replacement. Next, based on indexing we split out the train and test data.
下一步是將數據集分為80%訓練和20%測試。 在這里,我們使用sample()方法隨機選擇火車的觀察指標,并用替換進行測試拆分。 接下來,基于索引,我們拆分了訓練和測試數據。
set.seed(123)index <- sample(2, nrow(Diabetes), prob = c(0.8, 0.2), replace = TRUE)Diabetes_train <- Diabetes[index==1, ] # Train dataDiabetes_test <- Diabetes[index == 2, ] # Test data
The train data includes 318 observations and test data included 74 observations. Both contain 9 variables.
火車數據包括318個觀測值,測試數據包括74個觀測值。 兩者都包含9個變量。
print(dim(Diabetes_train))print(dim(Diabetes_test))Train and Test Dimension訓練和測試尺寸
模型訓練 (Model Training)
The next step is the model training and evaluation of model performance
下一步是模型訓練和模型性能評估
訓練決策樹 (Training a Decision Tree)
For decision tree training, we will use the rpart( ) function from the rpart library. The arguments include; formula for the model, data and method.
為了進行決策樹訓練,我們將使用rpart庫中的rpart()函數。 參數包括: 模型,數據和方法的公式。
formula = diabetes ~. i.e., diabetes is predicted by all independent variables (excluding diabetes)
公式=糖尿病?。 即,糖尿病是由所有獨立變量預測的(糖尿病除外)
Here, the method should be specified as the class for the classification task.
在此,應將方法指定為分類任務的類。
# Train a decision tree modelDiabetes_model <- rpart(formula = diabetes ~.,
data = Diabetes_train,
method = "class")
模型圖 (Model Plotting)
The main advantage of the tree-based model is that you can plot the tree structure and able to figure out the decision mechanism.
基于樹的模型的主要優點是您可以繪制樹結構并能夠確定決策機制。
# type: 0; Draw a split label at each split and a node label at each leaf.# yesno = 2; provides spli yes or no
# Extra = 0; no extra informationrpart.plot(x = Diabetes_model, yesno = 2, type = 0, extra = 0)Diabetes_model Tree StructureDiabetes_model樹結構
模型性能評估 (Model Performance Evaluation)
Next, step is to see how our trained model performs on the test/unseen dataset. For predicting the test data class we need to supply the model object, test dataset and the type = “class” inside the predict( ) function.
接下來,步驟是查看我們訓練有素的模型如何在測試/看不見的數據集上執行。 為了預測測試數據類,我們需要在predict()函數中提供模型對象 , 測試數據集和type =“ class” 。
# class predictionclass_predicted <- predict(object = Diabetes_model,
newdata = Diabetes_test,
type = "class")
(a) Confusion matrix
(a)混淆矩陣
To evaluate the test performance we are going to use the confusionMatrix( ) from caret library. We can observe that out of 74 observations it wrongly predicts 17 observations. The model has achieved about 77.03% accuracy using a single decision tree.
為了評估測試性能,我們將使用插入符號庫中的confusionMatrix() 。 我們可以觀察到,在74個觀察結果中,它錯誤地預測了17個觀察結果。 使用單個決策樹,該模型已達到約77.03%的準確性。
# Generate a confusion matrix for the test dataconfusionMatrix(data = class_predicted,reference = Diabetes_test$diabetes)Diabetes_model Test Evaluation StatisticsDiabetes_model測試評估統計
(b) Test accuracy
(b)測試準確性
We can also supply the predicted class labels and original test dataset labels to the accuracy( ) function for estimating the model accuracy.
我們還可以將預測的類別標簽和原始測試數據集標簽提供給precision()函數,以估計模型的準確性。
accuracy(actual = class_predicted,predicted = Diabetes_test$diabetes)Diabetes_model Test AccuracyDiabetes_model測試準確性
基于分裂準則的模型比較 (Splitting Criteria Based Model Comparision)
While building the model the decision tree algorithm uses splitting criteria. There are two popular splitting criteria used in decision trees; one is called “gini” and others called “information gain”. Here, we try to compare the model performance on the test set after training with different split criteria. The splitting criteria are supplied using parms argument as a list.
在構建模型時,決策樹算法使用拆分標準。 決策樹中使用了兩種流行的拆分標準: 一種稱為“基尼”,另一種稱為“信息增益”。 在這里,我們嘗試在使用不同的拆分標準進行訓練后,對測試集上的模型性能進行比較。 使用parms參數作為列表來提供拆分條件。
# Model training based on gini-based splitting criteriaDiabetes_model1 <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
parms = list(split = "gini"))# Model training based on information gain-based splitting criteria
Diabetes_model2 <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
parms = list(split = "information"))
測試數據的模型評估 (Model Evaluation on Test Data)
After model training, the next step is to predict the class labels of the test dataset.
經過模型訓練后,下一步是預測測試數據集的類標簽。
# Generate class predictions on the test data using gini-based splitting criteriapred1 <- predict(object = Diabetes_model1,
newdata = Diabetes_test,
type = "class")# Generate class predictions on test data using information gain based splitting criteria
pred2 <- predict(object = Diabetes_model2,
newdata = Diabetes_test,
type = "class")
預測精度比較 (Prediction Accuracy Comparision)
Next, we compare the accuracy of the models. Here, we can observe that “gini” based splitting criteria is providing a more accurate model than “information” based splitting.
接下來,我們比較模型的準確性。 在這里,我們可以觀察到,基于“ 基尼 ”的分割標準比基于“ 信息 ”的分割提供了更準確的模型。
# Compare classification accuracy on test dataaccuracy(actual = Diabetes_test$diabetes,
predicted = pred1)accuracy(actual = Diabetes_test$diabetes,
predicted = pred2)Diabetes_model1 Test AccuracyDiabetes_model1測試準確性 Diabetes_model2 Test AccuracyDiabetes_model2測試準確性
The initial model (Diabetes_model) and the “gini” based model (Diabetes_model1) providing the same accuracy, as rpart model uses “gini” as its default splitting criteria.
初始模型( Diabetes_model )和基于“ gini ”的模型( Diabetes_model1 )提供相同的準確性,因為rpart模型使用“ gini ”作為其默認拆分標準。
決策樹修剪 (Decision Tree Pruning)
The initial model (Diabetes_model) plot shows that the tree structure is deep and fragile which might reduce the easy interpretation in the decision-making process. Thus here we would try to explore other ways to make the tree more interpretable without losing performance. One way of doing this is by pruning the fragile part of the tree (part contributes to model overfitting).
初始模型( Diabetes_model )曲線表明,樹結構深而脆弱,這可能會降低決策過程中的易解釋性。 因此,在這里我們將嘗試探索其他方法來使樹更易于解釋,而不會損失性能。 一種方法是修剪樹的脆弱部分(部分有助于模型擬合)。
(a) Plotting the error vs complexity parameter
(a)繪制誤差與復雜度參數
The decision tree has one parameter called complexity parameter (cp) which controls the size of the decision tree. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We can generate the cp vs error plot using the plotcp( ) library.
決策樹具有一個稱為復雜性參數(cp)的參數 ,該參數控制決策樹的大小。 如果從當前節點向決策樹添加另一個變量的開銷高于cp的值,則樹的構建不會繼續。 我們可以使用plotcp()庫生成cp vs錯誤圖。
# Plotting Cost Parameter (CP) Tableplotcp(Diabetes_model1)Error vs CP Plot誤差與CP圖
(b) Generating complexity parameter table
(b)生成復雜度參數表
We can also generate the cp table by calling model$cptable. Here, you can observe that xerror is minimum with CP value of 0.025.
我們還可以通過調用model $ cptable生成cp表。 在這里,您可以觀察到xerror最小, CP值為0.025。
# Plotting the Cost Parameter (CP) Tableprint(Diabetes_model1$cptable)
(c) Obtaining an optimal pruned model
(c)獲得最佳修剪模型
We can filter out the optimal CP value by identifying the index of minimum xerror and by supplying it to the CP table.
我們可以通過識別最小xerror索引并將其提供給CP表來篩選出最佳CP值。
# Retrieve of optimal cp value based on cross-validated errorindex <- which.min(Diabetes_model1$cptable[, "xerror"])cp_optimal <- Diabetes_model1$cptable[index, "CP"]
The next step is to prune the tree using prune( ) function by supplying optimal CP value. If we plot the optimal pruned tree we can now observe that the tree is very simple and easy to interpret.
下一步是通過提供最佳CP值,使用prune()函數對樹進行修剪 。 現在,如果繪制最佳修剪樹,則可以觀察到該樹非常簡單且易于解釋。
If a person has a glucose level above 128 and age greater than 25 will be designated as diabetes positive else negative.
如果一個人的葡萄糖水平高于128且年齡大于25歲,將被認定為糖尿病陽性或陰性 。
# Pruning tree based on optimal CP valueDiabetes_model1_opt <- prune(tree = Diabetes_model1, cp = cp_optimal)rpart.plot(x = Diabetes_model1_opt, yesno = 2, type = 0, extra = 0)(d) Pruned tree performance
(d)修剪的樹木表現
The next step is to check whether the prune tree has similar performance or the performance has been compromised. After the performance check, we can see that the pruned tree is as capable as the earlier fragile tree but now it is simple and easy to interpret.
下一步是檢查修剪樹是否具有相似的性能或性能是否受到損害。 經過性能檢查后,我們可以看到修剪的樹與早期的脆弱樹一樣強大,但是現在它變得簡單易懂。
pred3 <- predict(object = Diabetes_model1_opt,newdata = Diabetes_test,
type = "class")accuracy(actual = Diabetes_test$diabetes,
predicted = pred3)
決策樹超參數調整 (Decision Tree Hyperparameter Tuning)
Next, we would try to increase the performance of the decision tree model by tuning its hyperparameters. The rpart( ) offers different hyperparameters but here we will try to tune two important parameters which are minsplit, and maxdepth.
接下來,我們將嘗試通過調整決策樹模型的超參數來提高其性能。 rpart()提供了不同的超參數,但是在這里我們將嘗試調整兩個重要的參數minsplit和maxdepth 。
minsplit: the minimum number of observations that must exist in the node in order for a split to be attempted.
minsplit :節點中必須存在的最小嘗試觀察數。
maxdepth: The maximum depth of any node of the final tree.
maxdepth :最終樹的任何節點的最大深度。
(a) Generating hyperparameter grid
(a)生成超參數網格
First, we generate a sequence 1 to 20 for both minsplit and maxdepth. Then we build a parameter combination grid using expand.grid( ) function.
首先,我們為最小分裂和最大深度生成一個1到20的序列。 然后,我們使用expand.grid()函數構建參數組合網格。
############################### Hyper parameter Grid Search
############################## Setting values for minsplit and maxdepth## the minimum number of observations that must exist in a node in order for a split to be attempted.
## Set the maximum depth of any node of the final tree
minsplit <- seq(1, 20, 1)
maxdepth <- seq(1, 20, 1)# Generate a search grid
hyperparam_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)
(b) Training grid-based models
(b)訓練基于網格的模型
The next step is to train different models based on each grid hyperparameter combination. This could be done through the following steps:
下一步是根據每個網格超參數組合訓練不同的模型。 這可以通過以下步驟完成:
- using a for loop to loop through each hyperparameter in the grid and then supplying it to rpart( ) function for model training 使用for循環遍歷網格中的每個超參數,然后將其提供給rpart()函數進行模型訓練
- storing each model into an empty list (diabetes_models) 將每個模型存儲到一個空列表中(diabetes_models)
num_models <- nrow(hyperparam_grid)# Create an empty list
diabetes_models <- list()# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {
minsplit <- hyperparam_grid$minsplit[i]
maxdepth <- hyperparam_grid$maxdepth[i]
# Train a model and store in the list
diabetes_models[[i]] <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
minsplit = minsplit,
maxdepth = maxdepth)
}
(c) Computing test accuracy
(c)計算測試準確性
The next step is to check the model performance of each model on test data and retrieving the best model. This could be done through the following steps:
下一步是根據測試數據檢查每個模型的模型性能,并檢索最佳模型。 這可以通過以下步驟完成:
- using a for loop to loop through each model in the list, and then predicting the test data and computing accuracy 使用for循環遍歷列表中的每個模型,然后預測測試數據和計算精度
- storing each model accuracy into an empty vector (accuracy_values) 將每個模型精度存儲到一個空向量中(accuracy_values)
num_models <- length(diabetes_models)# Create an empty vector to store accuracy values
accuracy_values <- c()# Use for loop for models accuracy estimation
for (i in 1:num_models) {
# Retrieve the model i from the list
model <- diabetes_models[[i]]
# Generate predictions on test data
pred <- predict(object = model,
newdata = Diabetes_test,
type = "class")
# Compute test accuracy and add to the empty vector accuracy_values
accuracy_values[i] <- accuracy(actual = Diabetes_test$diabetes,
predicted = pred)
}
(d) Identifying the best model
(d)確定最佳模式
The next step is to retrieve the best performing model (maximum accuracy) and printing its hyperparameters using model$control. We can observe that with a minimum split of 17 and a maximum depth of 6 the model provides most accurate results when evaluated on unseen/test dataset.
下一步是檢索性能最佳的模型(最大精度),并使用model $ control打印其超參數。 我們可以觀察到,在看不見/測試的數據集上進行評估時,該模型的最小拆分度為17 , 最大深度為6 ,可提供最準確的結果。
# Identify the model with maximum accuracybest_model <- diabetes_models[[which.max(accuracy_values)]]# Print the model hyper-parameters of the best model
best_model$control
(e) Best model evaluation on test data
(e)對測試數據進行最佳模型評估
After identifying the best performing model, the next step is to see how accurate the model is. Now, with the best hyperparameters, the model achieved an accuracy of 81.08% which is really great.
確定最佳性能模型后,下一步就是查看模型的準確性。 現在,使用最佳超參數,該模型的精度達到了81.08%,這的確非常棒。
# Best_model accuracy on test datapred <- predict(object = best_model,
newdata = Diabetes_test,
type = "class")
accuracy(actual = Diabetes_test$diabetes,
predicted = pred)
(f) Best model plot
(f)最佳模型圖
Now it is time to plot the best model.
現在是時候繪制最佳模型了。
rpart.plot(x = best_model, yesno = 2, type = 0, extra = 0)Best Model’s Layout最佳模型的布局Even the above plot is for best performing model, still, it looks a little bit fragile. So your next task would be to prune it and see if you get a better interpretable decision tree or not.
即使上面的圖是表現最佳的模型,仍然看起來有些脆弱。 因此,您的下一個任務是修剪它,看看您是否獲得了更好的可解釋性決策樹。
I hope you learned something new. See you next time!
我希望你學到了一些新東西。 下次見!
Note
注意
This article was first published on onezero.blog, a data science, machine learning and research related blogging platform maintained by me.
本文首次發表于onezero.blog ,數據科學,機器學習和研究相關的博客平臺維護由我。
**Read more by vising my personal blog website: https://onezero.blog/
** 訪問我的個人博客網站以了解更多信息 : https : //onezero.blog/
If you learned something new and liked this article, say 👋 / follow me on onezero.blog (my personal blogging website), Twitter, LinkedIn, YouTube and Github.
如果您學到新知識并喜歡這篇文章,請在 onezero.blog ( 我的個人博客網站 ) , Twitter , LinkedIn , YouTube 上說“👋/關注我”。 和 Github 。
[1] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.
[1] Breiman,L.,Friedman,J.,Stone,CJ和Olshen,RA,1984。 分類和回歸樹 。 CRC媒體。
[2] Loh, W. (2014). Fifty Years of Classification and Regression Trees 1.
[2] Loh,W.(2014)。 分類樹和回歸樹五十年1。
[3] Newman, C. B. D. & Merz, C. (1998). UCI Repository of machine learning databases, Technical report, University of California, Irvine, Dept. of Information and Computer Sciences.
[3] Newman,CBD&Merz,C.(1998)。 UCI機器學習數據庫存儲庫,技術報告,加利福尼亞大學歐文分校信息和計算機科學系。
Originally published at https://onezero.blog on August 2, 2020.
最初于 2020年8月2日 在 https://onezero.blog 上 發布 。
翻譯自: https://towardsdatascience.com/diabetes-classification-using-decision-trees-c4fd6dd7241a
基于決策樹的多分類
總結
以上是生活随笔為你收集整理的基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 新华保险买什么最划算,重疾险产品
- 下一篇: 电摩交强险怎么网上买