python决策树 多分类_Python中的决策树分类:您需要了解的一切
python決策樹 多分類
什么是決策樹? (What is Decision Tree?)
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
決策樹是一種決策支持工具,它使用決策的樹狀圖或模型及其可能的結(jié)果,包括偶然事件結(jié)果,資源成本和效用。 這是顯示僅包含條件控制語句的算法的一種方法。
Decision Trees (DTs) are a non-parametric supervised learning method used for both classification and regression. Decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules, and the fitter the model. The decision tree builds classification or regression models in the form of a tree structure, hence called CART (Classification and Regression Trees). It breaks down a data set into smaller and smaller subsets building along an associated decision tree at the same time. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. The leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called the root node. Decision trees can handle both categorical and numerical data.
決策樹(DT)是一種用于分類和回歸的非參數(shù)監(jiān)督學(xué)習(xí)方法。 決策樹從數(shù)據(jù)中學(xué)習(xí),以使用一組if-then-else決策規(guī)則來近似正弦曲線。 樹越深,決策規(guī)則越復(fù)雜,模型越合適。 決策樹以樹結(jié)構(gòu)的形式構(gòu)建分類或回歸模型,因此稱為CART(分類和回歸樹)。 它同時將數(shù)據(jù)集分解為越來越小的子集,這些子集沿著關(guān)聯(lián)的決策樹構(gòu)建。 最終結(jié)果是一棵具有決策節(jié)點和葉節(jié)點的樹。 決策節(jié)點具有兩個或更多分支。 葉節(jié)點表示分類或決策。 樹中與最佳預(yù)測變量相對應(yīng)的最高決策節(jié)點稱為根節(jié)點。 決策樹可以處理分類數(shù)據(jù)和數(shù)字?jǐn)?shù)據(jù)。
何時使用決策樹? (When is Decision Tree Used?)
如何制定決策樹? (How to Make a Decision Tree?)
Step 1
第1步
Calculate the entropy of the target.
計算目標(biāo)的熵。
Step 2
第2步
The dataset is then split into different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain or decrease in entropy.
然后將數(shù)據(jù)集拆分為不同的屬性。 計算每個分支的熵。 然后按比例將其相加,以獲得拆分的總熵。 從拆分之前的熵中減去所得的熵。 結(jié)果是信息獲取或熵減少。
Step 3
第三步
Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.
選擇信息增益最大的屬性作為決策節(jié)點,將數(shù)據(jù)集除以其分支,然后在每個分支上重復(fù)相同的過程。
熵和信息增益計算 (Entropy and Information Gain Calculations)
熵 (Entropy)
Where,
哪里,
- S is the total sample space, S是總樣本空間,
- P(yes) is the probability of yes P(是)是是的概率
If number of yes = nunmber of no i.e. P(S) = 0.5
如果是=否,即P(S)= 0.5
Entropy(S) = 1
熵(S)= 1
When P(yes) = P(no) = 0.5 i.e. YES +NO = Total Sample(S) = 1
當(dāng)P(yes)= P(no)= 0.5時,即YES + NO =總樣本量(S)= 1
If it contains all yes or all no i.e. P(S) = 1 or 0
如果包含全是或全無,即P(S)= 1或0
Entropy(S) = 0
熵(S)= 0
When P(yes) = 1 i.e. YES = Total Sample(S)
當(dāng)P(yes)= 1,即YES =總樣本量(S)
E(S) = 1 log 1
E(S)= 1對數(shù)1
E(S) = 0
E(S)= 0
信息增益 (Information Gain)
- Measure the reduction in entropy 測量熵的減少
- Decides which attribute should be selected as a decision node. 決定應(yīng)選擇哪個屬性作為決策節(jié)點。
If S is our total collection,
如果S是我們的總收藏,
Information Gain = Entropy(S) — [(Weighted Avg) x Entropy(each feature)]
信息增益=熵(S)-[(加權(quán)平均值)x熵(每個特征)]
決策樹的Python實現(xiàn) (Python Implementation of Decision Tree)
我們將使用以下庫。 (We will use the following libraries.)
We will use the BankNoteAuthentication dataset.
我們將使用BankNoteAuthentication數(shù)據(jù)集。
import pandas as pdimport numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinebankdata = pd.read_csv("../input/bank-note-authentication-uci-data/BankNote_Authentication.csv")
bankdata
功能選擇 (Feature Selection)
Here, you need to divide given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).
在這里,您需要將給定的列分為因變量(目標(biāo)變量)和自變量(或特征變量)兩種類型。
feature_cols = ['variance','skewness','curtosis','entropy']#split dataset in features and target variable
X = pima[feature_cols] # Features
y = pima['class'] # Target variable
分割數(shù)據(jù) (Splitting Data)
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.
為了了解模型的性能,將數(shù)據(jù)集分為訓(xùn)練集和測試集是一個很好的策略。
Let’s split the dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size.
讓我們使用函數(shù)train_test_split()拆分?jǐn)?shù)據(jù)集。 您需要傳遞3個參數(shù)功能,目標(biāo)和test_set大小。
# Split dataset into training set and test setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
建筑決策樹模型 (Building Decision Tree Model)
Let’s create a Decision Tree Model using Scikit-learn.
讓我們使用Scikit-learn創(chuàng)建一個決策樹模型。
# Create Decision Tree classifer objectclf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
評估模型 (Evaluating Model)
Let’s estimate, how accurately the classifier or model can predict the type of cultivars.
讓我們估計一下分類器或模型可以多么準(zhǔn)確地預(yù)測品種的類型。
Accuracy can be computed by comparing actual test set values and predicted values.
可以通過比較實際測試設(shè)置值和預(yù)測值來計算準(zhǔn)確性。
# Model Accuracy, how often is the classifier correct?print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
OUTPUT: Accuracy: 0.9878640776699029
輸出:精度:0.9878640776699029
混淆矩陣 (Confusion matrix)
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.
混淆矩陣是關(guān)于分類問題的預(yù)測結(jié)果的摘要。 正確和不正確的預(yù)測數(shù)會匯總計數(shù)值,并按每個類別進(jìn)行細(xì)分。 這是混淆矩陣的關(guān)鍵。 混淆矩陣顯示分類模型進(jìn)行預(yù)測時的混淆方式。 它不僅使我們了解分類器所產(chǎn)生的錯誤,而且更重要的是,可以了解所產(chǎn)生的錯誤的類型。
cm = confusion_matrix(y_test, y_pred)cm
OUTPUT:
輸出:
array([[231, 4], [ 1, 176]])
數(shù)組([[231,4],[1,176]])
Originally published at https://www.numpyninja.com on August 12, 2020.
最初于 2020年8月12日 發(fā)布在 https://www.numpyninja.com 上。
翻譯自: https://medium.com/analytics-vidhya/decision-tree-classification-in-python-everything-you-need-to-know-212160ec03f6
python決策樹 多分類
總結(jié)
以上是生活随笔為你收集整理的python决策树 多分类_Python中的决策树分类:您需要了解的一切的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 普罗米修斯监控docker 服务
- 下一篇: python快速制作倒计时gif图