當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

【Python机器学习】决策树ID3算法结果可视化附源代码对UCI数据集Caesarian Section进行分类

發(fā)布時(shí)間：2023/12/20 python 54 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python机器学习】决策树ID3算法结果可视化附源代码对UCI数据集Caesarian Section进行分类小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

決策樹(shù)

實(shí)現(xiàn)所用到的庫(kù)
實(shí)現(xiàn)
- 經(jīng)驗(yàn)熵計(jì)算
- - 經(jīng)驗(yàn)熵計(jì)算公式
- 條件熵
- 信息增益
- ID3
- - 選擇信息增益最大的屬性
  - 過(guò)程
- 擬合
- 預(yù)測(cè)
- 評(píng)估
決策樹(shù)可視化
- 決策樹(shù)保存
- 決策樹(shù)讀取
- 效果圖
總代碼
- 如何獲得每一步計(jì)算結(jié)果
實(shí)驗(yàn)結(jié)果（決策樹(shù)）
- debug模式

決策樹(shù)(Decision Tree）是在已知各種情況發(fā)生概率的基礎(chǔ)上，通過(guò)構(gòu)成決策樹(shù)來(lái)求取凈現(xiàn)值的期望值大于等于零的概率，評(píng)價(jià)項(xiàng)目風(fēng)險(xiǎn)，判斷其可行性的決策分析方法，是直觀運(yùn)用概率分析的一種圖解法。由于這種決策分支畫(huà)成圖形很像一棵樹(shù)的枝干，故稱(chēng)決策樹(shù)。來(lái)源：決策樹(shù)_百度百科

數(shù)據(jù)集使用UCI數(shù)據(jù)集 Caesarian Section Classification Dataset Data Set

【與數(shù)據(jù)集相關(guān)的詳細(xì)信息和下載地址】

本代碼實(shí)現(xiàn)了決策樹(shù)ID3算法，并使用決策樹(shù)ID3算法進(jìn)行預(yù)測(cè)。
決策樹(shù)算法寫(xiě)到類(lèi)中，實(shí)現(xiàn)代碼復(fù)用，并在使用過(guò)程中降低復(fù)雜度。
將logging日志等級(jí)調(diào)整為DEBUG，可以輸出決策樹(shù)每一步的詳細(xì)過(guò)程。
通過(guò)使用mermaid的文本繪圖格式對(duì)決策樹(shù)進(jìn)行了可視化。

實(shí)現(xiàn)所用到的庫(kù)

Python 3
Pandas
sklearn（僅用于切分?jǐn)?shù)據(jù)集）
numpy

實(shí)現(xiàn)

經(jīng)驗(yàn)熵計(jì)算

熵中的概率由數(shù)據(jù)估計(jì)(特別是最大似然估計(jì))得到時(shí)，所對(duì)應(yīng)的熵稱(chēng)為經(jīng)驗(yàn)熵

經(jīng)驗(yàn)熵計(jì)算公式

$-\sum^n_{i=1}p(x_i)log_2(p(x_i))$

def empirical_entropy(self, dataset=None):"""求經(jīng)驗(yàn)熵$$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$:return: Float 經(jīng)驗(yàn)熵"""if dataset is None:dataset = self.DataSetcolumns_count = dataset.iloc[:, -1].value_counts()entropy = 0total_count = columns_count.sum()for count in columns_count:p = count / total_countentropy -= p * np.log2(p)return entropy

條件熵

條件熵 H(Y∣X)H(Y|X)H(Y∣X)表示在已知隨機(jī)變量X的條件下隨機(jī)變量Y的不確定性。

定義X給定條件下Y的條件概率分布的熵對(duì)X的數(shù)學(xué)期望：

$\sum_{i=1}^np(i)H(Y|X=x_i)$

信息增益

信息增益表示得知特征X的信息而使得類(lèi)Y的信息不確定性減少的程度。
即：選擇該特征對(duì)分類(lèi)的幫助程度。

在分類(lèi)問(wèn)題困難時(shí)，也就是說(shuō)在訓(xùn)練數(shù)據(jù)集經(jīng)驗(yàn)熵大的時(shí)候，信息增益值會(huì)偏大，反之信息增益值會(huì)偏小。

使用信息增益比可以對(duì)這個(gè)問(wèn)題進(jìn)行校正，這是特征選擇的另一個(gè)標(biāo)準(zhǔn)。

特征A對(duì)訓(xùn)練數(shù)據(jù)集D的信息增益g(D,A)，定義為集合D的經(jīng)驗(yàn)熵H(D)與特征A給定條件下D的經(jīng)驗(yàn)條件熵H(D|A)之差：

$g (D, A) = H (D) ? H (D ∣ A)$

ID3

簡(jiǎn)單來(lái)說(shuō)，就是不斷選取能夠?qū)Ψ诸?lèi)提供最大效果的屬性，然后根據(jù)屬性的各個(gè)值選取接下來(lái)的最佳屬性

選擇信息增益最大的屬性

因?yàn)闂l件經(jīng)驗(yàn)熵越小（表示該分類(lèi)的結(jié)果比較統(tǒng)一，即信息增益越大）表示該屬性對(duì)于分類(lèi)重要性越大。

其中extract_dataset 相當(dāng)于在符合指定條件下數(shù)據(jù)集，用于接下來(lái)計(jì)算條件經(jīng)驗(yàn)熵，并獲得信息增益。

def extract_dataset(self, dataset: pd.DataFrame, column, label):"""根據(jù)column和label篩選出指定的數(shù)據(jù)集:return: pd.DataFrame 篩選后的數(shù)據(jù)集"""if type(column) == int:split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)else:split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)return split_datasetdef best_empirical_entropy(self, dataset: pd.DataFrame = None):"""選取數(shù)據(jù)集中的columns中，最好的column(經(jīng)驗(yàn)熵最大）:param dataset: 帶選取的數(shù)據(jù)集:return: 返回column"""if dataset is None:dataset = self.DataSetcolumns = dataset.columns[:-1]total_count = dataset.shape[0]empirical_entropy = self.empirical_entropy(dataset)logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")logging.debug(f"empirical_entropy is {empirical_entropy}")informationGain_max = -1best_column = Nonefor column in columns:entropy_tmp = 0data_counts = dataset.loc[:, column].value_counts()data_labels = data_counts.indexlogging.debug(f"now is {column}")for label in data_labels:split_dataset = self.extract_dataset(dataset, column, label)count = split_dataset.shape[0]p = count / total_countentropy_tmp += p * self.empirical_entropy(split_dataset)logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "f"entropy: {self.empirical_entropy(split_dataset)}")informationGain = empirical_entropy - entropy_tmplogging.debug(f"entropy： {entropy_tmp}, {column} informationGain:{informationGain}")if informationGain > informationGain_max:best_column = columninformationGain_max = informationGainlogging.debug(f"Choose {best_column}:{informationGain_max}")return best_column

過(guò)程

選取信息增益最大的屬性。

如果各個(gè)屬性的最大的信息增益不夠大，即對(duì)分類(lèi)幫助有限，此時(shí)直接設(shè)定為結(jié)果分類(lèi)中，數(shù)量最多的一個(gè)值。

如果沒(méi)有可以選取的屬性（因?yàn)閷傩栽谥耙呀?jīng)選擇完了），此時(shí)同樣選取結(jié)果數(shù)量最多的一個(gè)值。

造成沒(méi)有可以選取的原因：因?yàn)榭赡芡粋€(gè)屬性，可能有不同結(jié)果。

選取當(dāng)前屬性的各個(gè)值，然后分別執(zhí)行1；

當(dāng)遞歸完畢，即每個(gè)屬性的值最終都有一個(gè)值，即為決策樹(shù)，如果在測(cè)試過(guò)程出現(xiàn)訓(xùn)練階段沒(méi)有出現(xiàn)的結(jié)果，可以為每一個(gè)屬性單獨(dú)設(shè)置一個(gè)其他值用于表示決策樹(shù)中沒(méi)有該屬性的值時(shí)決策樹(shù)的輸出結(jié)果，這個(gè)值可以設(shè)置為當(dāng)前屬性數(shù)量最多的結(jié)果值。

def id3(self, dataset: pd.DataFrame = None):'''實(shí)現(xiàn)決策樹(shù)的ID3算法:param dataset: 輸入的數(shù)據(jù)集:return: dict 決策樹(shù)節(jié)點(diǎn)'''if dataset is None:dataset = self.DataSetnext_tree = {}result_count = dataset.iloc[:, -1].value_counts()result_max = result_count.idxmax()next_tree["其他"] = result_maxif result_count.shape[0] == 1 or dataset.shape[1] < 2 or self.empirical_entropy(dataset) < self._threshold:self._leafCount += 1logging.debug(f"select decision {result_max}, result_type:{result_count.tolist()}, dataset column:{dataset.shape}, lower than threshold:{self.empirical_entropy(dataset) < self._threshold}")tree = {"next": next_tree}else:best_column = self.best_empirical_entropy(dataset)value_counts = dataset[best_column].value_counts()labels = value_counts.indexfor label in labels:logging.debug(f"now choose_column:{best_column}, label: {label}")split_dataset = self.extract_dataset(dataset, best_column, label)next_decision = self.id3(split_dataset)next_tree[label] = next_decisiontree = {"column": best_column, "next": next_tree}return tree

擬合

def fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):'''擬合函數(shù)，輸入數(shù)據(jù)集進(jìn)行擬合，其中如果y沒(méi)有輸入，則x的最后一列應(yīng)包含分類(lèi)結(jié)果:param x: pd.DataFrame數(shù)據(jù)集的屬性（當(dāng)y為None時(shí)，為整個(gè)數(shù)據(jù)集-包含結(jié)果）:param y: list like,shape=(-1,)數(shù)據(jù)集的結(jié)果:param algorithm: 選擇算法（目前僅有ID3）:param threshold: 選擇信息增益的閾值:return: 決策樹(shù)的根節(jié)點(diǎn)'''self.check_dataset(x, dimension=2)self.check_dataset(y, dimension=1)self._threshold = thresholddataset = xif y is not None:dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)self.decision_tree = eval("self." + algorithm)(dataset)logging.info(f"decision_tree leaf:{self._leafCount}")return self.decision_tree

預(yù)測(cè)

def predict(self, x: pd.DataFrame):'''預(yù)測(cè)數(shù)據(jù):param x:pd.DataFrame 輸入的數(shù)據(jù)集:return: 分類(lèi)結(jié)果'''self.y_predict = x.apply(self._predict_line, axis=1)return self.y_predictdef _predict_line(self, line):"""私有函數(shù)，用于在predict中，對(duì)每一行數(shù)據(jù)進(jìn)行預(yù)測(cè):param line: 輸入的數(shù)據(jù)集的某一行數(shù)據(jù):return: 該一行的分類(lèi)結(jié)果"""tree = self.decision_treewhile True:try:if len(tree["next"]) == 1:return tree["next"]["其他"]else:value = line[tree["column"]]tree = tree["next"][value]except:return tree["next"]["其他"]

評(píng)估

評(píng)估結(jié)果的準(zhǔn)確度，精確度，召回率。

score評(píng)估函數(shù)：僅適用于二分類(lèi)，對(duì)于多分類(lèi)該算法不適用（但是決策樹(shù)代碼可以predict預(yù)測(cè)）
同時(shí)score判斷正例需要結(jié)果為1，反例結(jié)果為0。

def score(self, y):'''評(píng)估函數(shù)，用于評(píng)估結(jié)果:param y: 輸入實(shí)際的結(jié)果:return: None'''if self.y_predict is None:raise Exception("before score should predict first!")y_acutalTrue = y[(y == 1) & (self.y_predict == 1)].shape[0]y_acutalFalse = y[(y == 0) & (self.y_predict == 0)].shape[0]y_predictTrue = self.y_predict[self.y_predict == 1].shape[0]y_true = y[y == 1].shape[0]y_total = y.shape[0]logging.debug(f"y_acutalTrue:{y_acutalTrue}, y_acutalFalse:{y_acutalFalse}, y_predictTrue:{y_predictTrue}, "f"y_true:{y_true}, y_total:{y_total}")Accuracy = (y_acutalTrue + y_acutalFalse) / y_totalPrecision = y_acutalTrue / y_predictTrueRecall = y_acutalTrue / y_trueprint("Accuracy: ", Accuracy,"Precision: ", Precision,"Recall: ", Recall)

決策樹(shù)可視化

利用mermaid文本繪圖，將預(yù)測(cè)的值做了合并，同一屬性的不同值但是分類(lèi)結(jié)果相同，則可視化時(shí)都指向同一個(gè)輸出節(jié)點(diǎn)。

可視化函數(shù)提供了兩種輸出格式
- markdown格式
- html格式（推薦，使用瀏覽器即可查看決策樹(shù)）

決策樹(shù)保存

def save(self, savePath: str):open(savePath, "w").write(str(decisionTree.decision_tree))logging.info(f"決策樹(shù)已保存，位置:{savePath}")

決策樹(shù)讀取

def load(self, savePath: str):tree = eval(open(savePath, "r").read())if type(tree) == dict:self.decision_tree = treeelse:raise Exception("Load Faild!")

效果圖

示例圖，非數(shù)據(jù)集分類(lèi)結(jié)果圖

def visualOutput(self, savePath="", outputFormat="html", direction="TD"):'''將決策樹(shù)可視化輸出,格式為‘md'或’html':param outputFormat: 設(shè)置輸出格式:return: 對(duì)應(yīng)輸出格式的文本'''if self.decision_tree is None:raise Exception("should fit first!")text = ""if outputFormat == "md":text = self._format_md(direction=direction)elif outputFormat == "html":text = self._format_html(direction=direction)if savePath != "":open(savePath, "w", encoding="utf-8").write(text)return textdef _format_html(self, direction="TD"):'''決策樹(shù)的可視化為html格式:return: html代碼'''html_start = '<!DOCTYPE html><html lang="en"><head>' \'<meta charset="UTF-8">' \'<meta name="viewport" content="width=device-width, initial-scale=1.0">' \'<meta http-equiv="X-UA-Compatible" content="ie=edge">' \'<title>DecisionTree</title>' \'<script src="https://cdn.bootcss.com/mermaid/8.0.0-rc.8/mermaid.min.js">' \'</script></head><body><div class="mermaid">{}</div>'html_end = '</body></html>'mermaid = self._format_md(end=";", direction=direction)mermaid = mermaid.replace("```mermaid\n", "").replace("```", "")html = html_start.replace("{}", mermaid)+html_endreturn htmldef _format_md(self, direction="TD", end="\n"):'''決策樹(shù)的可視化為md代碼（mermaid代碼）:param end: 設(shè)置每行結(jié)尾符號(hào):param direction: 設(shè)置方向:return:'''md = "```mermaid\n"md += f"graph {direction}{end}"total_node = 1current_nodeID = 0if len(self.decision_tree) != 2:code_line = f"{current_nodeID}(start)-->{self.decision_tree['next']['其他']}"return md + code_line + "\n```"queue = [self.decision_tree]while len(queue) > 0:node = queue.pop(0)ans_node = []for key in node["next"].keys():if type(node['next'][key]) == dict:if len(node['next'][key]) == 1:decision = node['next'][key]['next']['其他']if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"else:code_line = f"{current_nodeID}({node['column']})--{key}-->{total_node}"queue.append(node["next"][key])total_node += 1else:decision = node['next'][key]if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"# code_line_b = str(code_line.encode("utf-8")).lstrip("b'").rstrip("'")md += code_line+endcurrent_nodeID += 1return md + "```"

總代碼

如何獲得每一步計(jì)算結(jié)果

不想要那么多過(guò)程，可以將開(kāi)頭的logging.basicConfig中的level設(shè)置為INFO即可。

即：
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

修改為：
logging.basicConfig(level=logging.INFO, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

如果需要導(dǎo)出日志：
　參數(shù)filename為輸出日志位置。
　參數(shù)filemode為輸出日志寫(xiě)入模式。
logging.basicConfig(level=logging.DEBUG, filename='DecisionTree.log', filemode='w', format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

運(yùn)行代碼可能存在問(wèn)題

數(shù)據(jù)集不對(duì)：Caesarian Section Classification Dataset下載后為arff格式，該代碼使用的數(shù)據(jù)集格式為csv，需要將arff中的數(shù)據(jù)提取出來(lái)，可以使用記事本，將arff的數(shù)據(jù)部分保存為csv格式即可。
此外本代碼提供一個(gè)demo，無(wú)需外部數(shù)據(jù)集亦可運(yùn)行。
score評(píng)估函數(shù)：僅適用于二分類(lèi)，對(duì)于多分類(lèi)該算法不適用（決策樹(shù)可以predict），同時(shí)score判斷正例需要結(jié)果為1，反例結(jié)果為0。

import pandas as pd import numpy as np import logging from sklearn.model_selection import train_test_splitlogging.basicConfig(level=logging.DEBUG,format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s") """ application: Decision_tree-ID3 writer: Flysky Date: 2020年10月14日 """class DecisionTree:def __init__(self):self.DataSet = Noneself._threshold = 0.1self._leafCount = 0self.decision_tree = Noneself.y_predict = Nonedef check_dataset(self, dataset: pd.DataFrame, dimension=2):if len(dataset.shape) != dimension:raise ValueError(f"data dimension not {dimension} but {len(dataset.shape)}")def empirical_entropy(self, dataset=None):"""求經(jīng)驗(yàn)熵$$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$:return: Float 經(jīng)驗(yàn)熵"""if dataset is None:dataset = self.DataSetcolumns_count = dataset.iloc[:, -1].value_counts()entropy = 0total_count = columns_count.sum()for count in columns_count:p = count / total_countentropy -= p * np.log2(p)return entropydef extract_dataset(self, dataset: pd.DataFrame, column, label):"""根據(jù)column和label篩選出指定的數(shù)據(jù)集:return: pd.DataFrame 篩選后的數(shù)據(jù)集"""if type(column) == int:split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)else:split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)return split_datasetdef best_empirical_entropy(self, dataset: pd.DataFrame = None):"""選取數(shù)據(jù)集中的columns中，最好的column(經(jīng)驗(yàn)熵最大）:param dataset: 帶選取的數(shù)據(jù)集:return: 返回column"""if dataset is None:dataset = self.DataSetcolumns = dataset.columns[:-1]total_count = dataset.shape[0]empirical_entropy = self.empirical_entropy(dataset)logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")logging.debug(f"empirical_entropy is {empirical_entropy}")informationGain_max = -1best_column = Nonefor column in columns:entropy_tmp = 0data_counts = dataset.loc[:, column].value_counts()data_labels = data_counts.indexlogging.debug(f"now is {column}")for label in data_labels:split_dataset = self.extract_dataset(dataset, column, label)count = split_dataset.shape[0]p = count / total_countentropy_tmp += p * self.empirical_entropy(split_dataset)logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "f"entropy: {self.empirical_entropy(split_dataset)}")informationGain = empirical_entropy - entropy_tmplogging.debug(f"entropy： {entropy_tmp}, {column} informationGain:{informationGain}")if informationGain > informationGain_max:best_column = columninformationGain_max = informationGainlogging.debug(f"Choose {best_column}:{informationGain_max}")return best_columndef id3(self, dataset: pd.DataFrame = None):'''實(shí)現(xiàn)決策樹(shù)的ID3算法:param dataset: 輸入的數(shù)據(jù)集:return: dict 決策樹(shù)節(jié)點(diǎn)'''if dataset is None:dataset = self.DataSetnext_tree = {}result_count = dataset.iloc[:, -1].value_counts()result_max = result_count.idxmax()next_tree["其他"] = result_maxif result_count.shape[0] == 1 or dataset.shape[1] < 2 or self.empirical_entropy(dataset) < self._threshold:self._leafCount += 1logging.debug(f"select decision {result_max}, result_type:{result_count.tolist()}, dataset column:{dataset.shape}, lower than threshold:{self.empirical_entropy(dataset) < self._threshold}")tree = {"next": next_tree}else:best_column = self.best_empirical_entropy(dataset)value_counts = dataset[best_column].value_counts()labels = value_counts.indexfor label in labels:logging.debug(f"now choose_column:{best_column}, label: {label}")split_dataset = self.extract_dataset(dataset, best_column, label)next_decision = self.id3(split_dataset)next_tree[label] = next_decisiontree = {"column": best_column, "next": next_tree}return treedef fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):'''擬合函數(shù)，輸入數(shù)據(jù)集進(jìn)行擬合，其中如果y沒(méi)有輸入，則x的最后一列應(yīng)包含分類(lèi)結(jié)果:param x: pd.DataFrame數(shù)據(jù)集的屬性（當(dāng)y為None時(shí)，為整個(gè)數(shù)據(jù)集-包含結(jié)果）:param y: list like,shape=(-1,)數(shù)據(jù)集的結(jié)果:param algorithm: 選擇算法（目前僅有ID3）:param threshold: 選擇信息增益的閾值:return: 決策樹(shù)的根節(jié)點(diǎn)'''self.check_dataset(x, dimension=2)self.check_dataset(y, dimension=1)self._threshold = thresholddataset = xif y is not None:dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)self.decision_tree = eval("self." + algorithm)(dataset)logging.info(f"decision_tree leaf:{self._leafCount}")return self.decision_treedef leaf_count(self):'''統(tǒng)計(jì)葉子節(jié)點(diǎn)個(gè)數(shù)（此處的葉子節(jié)點(diǎn)即能確定分類(lèi)的屬性值所對(duì)應(yīng)的分類(lèi)結(jié)果值:return: 葉子節(jié)點(diǎn)個(gè)數(shù)'''return self._leafCountdef predict(self, x: pd.DataFrame):'''預(yù)測(cè)數(shù)據(jù):param x:pd.DataFrame 輸入的數(shù)據(jù)集:return: 分類(lèi)結(jié)果'''self.y_predict = x.apply(self._predict_line, axis=1)return self.y_predictdef _predict_line(self, line):"""私有函數(shù)，用于在predict中，對(duì)每一行數(shù)據(jù)進(jìn)行預(yù)測(cè):param line: 輸入的數(shù)據(jù)集的某一行數(shù)據(jù):return: 該一行的分類(lèi)結(jié)果"""tree = self.decision_treewhile True:try:if len(tree["next"]) == 1:return tree["next"]["其他"]else:value = line[tree["column"]]tree = tree["next"][value]except:return tree["next"]["其他"]def score(self, y):'''評(píng)估函數(shù)，用于評(píng)估結(jié)果:param y: 輸入實(shí)際的結(jié)果:return: None'''if self.y_predict is None:raise Exception("before score should predict first!")y_acutalTrue = y[(y == 1) & (self.y_predict == 1)].shape[0]y_acutalFalse = y[(y == 0) & (self.y_predict == 0)].shape[0]y_predictTrue = self.y_predict[self.y_predict == 1].shape[0]y_true = y[y == 1].shape[0]y_total = y.shape[0]logging.debug(f"y_acutalTrue:{y_acutalTrue}, y_acutalFalse:{y_acutalFalse}, y_predictTrue:{y_predictTrue}, "f"y_true:{y_true}, y_total:{y_total}")Accuracy = (y_acutalTrue + y_acutalFalse) / y_totalPrecision = y_acutalTrue / y_predictTrueRecall = y_acutalTrue / y_trueprint("Accuracy: ", Accuracy,"Precision: ", Precision,"Recall: ", Recall)def visualOutput(self, savePath="", outputFormat="html", direction="TD"):'''將決策樹(shù)可視化輸出,格式為‘md'或’html':param outputFormat: 設(shè)置輸出格式:return: 對(duì)應(yīng)輸出格式的文本'''if self.decision_tree is None:raise Exception("should fit first!")text = ""if outputFormat == "md":text = self._format_md(direction=direction)elif outputFormat == "html":text = self._format_html(direction=direction)if savePath != "":open(savePath, "w", encoding="utf-8").write(text)return textdef _format_html(self, direction="TD"):'''決策樹(shù)的可視化為html格式:return: html代碼'''html_start = '<!DOCTYPE html><html lang="en"><head>' \'<meta charset="UTF-8">' \'<meta name="viewport" content="width=device-width, initial-scale=1.0">' \'<meta http-equiv="X-UA-Compatible" content="ie=edge">' \'<title>DecisionTree</title>' \'<script src="https://cdn.bootcss.com/mermaid/8.0.0-rc.8/mermaid.min.js">' \'</script></head><body><div class="mermaid">{}</div>'html_end = '</body></html>'mermaid = self._format_md(end=";", direction=direction)mermaid = mermaid.replace("```mermaid\n", "").replace("```", "")html = html_start.replace("{}", mermaid) + html_endreturn htmldef _format_md(self, direction="TD", end="\n"):'''決策樹(shù)的可視化為md代碼（mermaid代碼）:param end: 設(shè)置每行結(jié)尾符號(hào):param direction: 設(shè)置方向:return:'''md = "```mermaid\n"md += f"graph {direction}{end}"total_node = 1current_nodeID = 0if len(self.decision_tree) != 2:code_line = f"{current_nodeID}(start)-->{self.decision_tree['next']['其他']}"return md + code_line + "\n```"queue = [self.decision_tree]while len(queue) > 0:node = queue.pop(0)ans_node = []for key in node["next"].keys():if type(node['next'][key]) == dict:if len(node['next'][key]) == 1:decision = node['next'][key]['next']['其他']if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"else:code_line = f"{current_nodeID}({node['column']})--{key}-->{total_node}"queue.append(node["next"][key])total_node += 1else:decision = node['next'][key]if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"# code_line_b = str(code_line.encode("utf-8")).lstrip("b'").rstrip("'")md += code_line + endcurrent_nodeID += 1return md + "```"def load(self, savePath: str):tree = eval(open(savePath, "r").read())if type(tree) == dict:self.decision_tree = treeelse:raise Exception("Load Faild!")def save(self, savePath: str):open(savePath, "w").write(str(decisionTree.decision_tree))logging.info(f"決策樹(shù)已保存，位置:{savePath}")if __name__ == '__main__':# 初始化決策樹(shù)decisionTree = DecisionTree()# 不需要外部數(shù)據(jù)集的demodemo_data = [[0, 2, 0, 0, 0],[0, 2, 0, 1, 0],[1, 2, 0, 0, 1],[2, 1, 0, 0, 1],[2, 0, 1, 0, 1],[2, 0, 1, 1, 0],[1, 0, 1, 1, 1],[0, 1, 0, 0, 0],[0, 0, 1, 0, 1],[2, 1, 1, 0, 1],[0, 1, 1, 1, 1],[1, 1, 0, 1, 1],[1, 2, 1, 0, 1],[2, 1, 0, 1, 0]]dataset = pd.DataFrame(demo_data)dataset.columns = ['年齡', '有工作', '是學(xué)生', '信貸情況', "借貸"]# UCI數(shù)據(jù)集Caesarian Section Classification# dataset = pd.read_csv("caesarian.csv", header=None)# dataset.columns = ["Age", "Delivery_number", "Delivery_time", "Blood_of_Pressure", "Heart_Problem", "Caesarian"]# age = dataset["Age"].value_counts().sort_index() # 將Age分為三層，低于24歲，低于31歲，高于30歲# dataset["Age"][dataset["Age"] < 24] = 0# dataset["Age"][(dataset["Age"] > 23) & (dataset["Age"] < 31)] = 1# dataset["Age"][30 < dataset["Age"]] = 2# print(dataset.info())# 將數(shù)據(jù)集的屬性和結(jié)果分開(kāi)X = dataset.iloc[:, :-1]Y = dataset.iloc[:, -1]# 使用skleran切分?jǐn)?shù)據(jù)集# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, shuffle=True)# else直接使用數(shù)據(jù)集作為測(cè)試集X_train = X_test = XY_train = Y_test = Y# 擬合e = decisionTree.fit(X_train, Y_train, threshold=-1)# 保存決策樹(shù)decisionTree.save("decisionTree.txt")# 加載決策樹(shù)decisionTree.load("decisionTree.txt")# 預(yù)測(cè)predict_y = decisionTree.predict(X_test)# 評(píng)估decisionTree.score(Y_test)# 可視化輸出（html格式）# visualOutput可選參數(shù)outputFormat=["md", "html"],direction方向，設(shè)置決策樹(shù)的方向=["LR","RL","TD","DT"],默認(rèn)TD，從上到下decisionTree.visualOutput(savePath="decisionTree.html", outputFormat="html")

實(shí)驗(yàn)結(jié)果（決策樹(shù)）

debug模式

使用demo數(shù)據(jù)集運(yùn)行

2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (14, 5), column is ['年齡', '有工作', '是學(xué)生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9402859586706311 2020-10-14 00:47:19,831-[root] [DEBUG] [best_empirical_entropy]: now is 年齡 2020-10-14 00:47:19,849-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,859-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (4, 4), Ans count: [4], entropy: 0.0 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.6935361388961918, 年齡 informationGain:0.24674981977443933 2020-10-14 00:47:19,868-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,880-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [4, 2], entropy: 0.9182958340544896 2020-10-14 00:47:19,889-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (4, 4), Ans count: [2, 2], entropy: 1.0 2020-10-14 00:47:19,896-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (4, 4), Ans count: [3, 1], entropy: 0.8112781244591328 2020-10-14 00:47:19,897-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9110633930116763, 有工作 informationGain:0.02922256565895487 2020-10-14 00:47:19,898-[root] [DEBUG] [best_empirical_entropy]: now is 是學(xué)生 2020-10-14 00:47:19,909-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (7, 4), Ans count: [6, 1], entropy: 0.5916727785823275 2020-10-14 00:47:19,917-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (7, 4), Ans count: [4, 3], entropy: 0.9852281360342515 2020-10-14 00:47:19,918-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.7884504573082896, 是學(xué)生 informationGain:0.15183550136234159 2020-10-14 00:47:19,920-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:19,927-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (8, 4), Ans count: [6, 2], entropy: 0.8112781244591328 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [3, 3], entropy: 1.0 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.8921589282623617, 信貸情況 informationGain:0.04812703040826949 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: Choose 年齡:0.24674981977443933 2020-10-14 00:47:19,940-[root] [DEBUG] [id3]: now choose_column:年齡, label: 2 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學(xué)生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:19,953-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,964-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9509775004326937, 有工作 informationGain:0.01997309402197489 2020-10-14 00:47:19,976-[root] [DEBUG] [best_empirical_entropy]: now is 是學(xué)生 2020-10-14 00:47:19,983-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9509775004326937, 是學(xué)生 informationGain:0.01997309402197489 2020-10-14 00:47:19,995-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,004-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.0, 信貸情況 informationGain:0.9709505944546686 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: Choose 信貸情況:0.9709505944546686 2020-10-14 00:47:20,015-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 0 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: select decision 1, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 1 2020-10-14 00:47:20,027-[root] [DEBUG] [id3]: select decision 0, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,028-[root] [DEBUG] [id3]: now choose_column:年齡, label: 0 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學(xué)生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:20,038-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:20,046-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,052-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (1, 3), Ans count: [1], entropy: 0.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.4, 有工作 informationGain:0.5709505944546686 2020-10-14 00:47:20,061-[root] [DEBUG] [best_empirical_entropy]: now is 是學(xué)生 2020-10-14 00:47:20,068-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.0, 是學(xué)生 informationGain:0.9709505944546686 2020-10-14 00:47:20,077-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,085-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9509775004326937, 信貸情況 informationGain:0.01997309402197489 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: Choose 是學(xué)生:0.9709505944546686 2020-10-14 00:47:20,094-[root] [DEBUG] [id3]: now choose_column:是學(xué)生, label: 0 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: select decision 0, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: now choose_column:是學(xué)生, label: 1 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: select decision 1, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: now choose_column:年齡, label: 1 2020-10-14 00:47:20,112-[root] [DEBUG] [id3]: select decision 1, result_type:[4], dataset column:(4, 4), lower than threshold:False 2020-10-14 00:47:20,112-[root] [INFO] [fit]: decision_tree leaf:5 2020-10-14 00:47:20,113-[root] [INFO] [save]: 決策樹(shù)已保存，位置:decisionTree.txt 2020-10-14 00:47:20,123-[root] [DEBUG] [score]: y_acutalTrue:9, y_acutalFalse:5, y_predictTrue:9, y_true:9, y_total:14

總結(jié)

以上是生活随笔為你收集整理的【Python机器学习】决策树ID3算法结果可视化附源代码对UCI数据集Caesarian Section进行分类的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Error Domain=NSCocoa
下一篇： python爬取cnnvd，粘贴可用