【Python机器学习】决策树ID3算法结果可视化附源代码 对UCI数据集Caesarian Section进行分类
決策樹(shù)
- 實(shí)現(xiàn)所用到的庫(kù)
- 實(shí)現(xiàn)
- 經(jīng)驗(yàn)熵計(jì)算
- 經(jīng)驗(yàn)熵計(jì)算公式
- 條件熵
- 信息增益
- ID3
- 選擇信息增益最大的屬性
- 過(guò)程
- 擬合
- 預(yù)測(cè)
- 評(píng)估
- 決策樹(shù)可視化
- 決策樹(shù)保存
- 決策樹(shù)讀取
- 效果圖
- 總代碼
- 如何獲得每一步計(jì)算結(jié)果
- 實(shí)驗(yàn)結(jié)果(決策樹(shù))
- debug模式
決策樹(shù)(Decision Tree)是在已知各種情況發(fā)生概率的基礎(chǔ)上,通過(guò)構(gòu)成決策樹(shù)來(lái)求取凈現(xiàn)值的期望值大于等于零的概率,評(píng)價(jià)項(xiàng)目風(fēng)險(xiǎn),判斷其可行性的決策分析方法,是直觀運(yùn)用概率分析的一種圖解法。由于這種決策分支畫(huà)成圖形很像一棵樹(shù)的枝干,故稱(chēng)決策樹(shù)。 來(lái)源:決策樹(shù)_百度百科
數(shù)據(jù)集使用UCI數(shù)據(jù)集 Caesarian Section Classification Dataset Data Set
【與數(shù)據(jù)集相關(guān)的詳細(xì)信息和下載地址】
- 本代碼實(shí)現(xiàn)了決策樹(shù)ID3算法,并使用決策樹(shù)ID3算法進(jìn)行預(yù)測(cè)。
- 決策樹(shù)算法寫(xiě)到類(lèi)中,實(shí)現(xiàn)代碼復(fù)用,并在使用過(guò)程中降低復(fù)雜度。
- 將logging日志等級(jí)調(diào)整為DEBUG,可以輸出決策樹(shù)每一步的詳細(xì)過(guò)程。
- 通過(guò)使用mermaid的文本繪圖格式對(duì)決策樹(shù)進(jìn)行了可視化。
實(shí)現(xiàn)所用到的庫(kù)
- Python 3
- Pandas
- sklearn(僅用于切分?jǐn)?shù)據(jù)集)
- numpy
實(shí)現(xiàn)
經(jīng)驗(yàn)熵計(jì)算
熵中的概率由數(shù)據(jù)估計(jì)(特別是最大似然估計(jì))得到時(shí),所對(duì)應(yīng)的熵稱(chēng)為經(jīng)驗(yàn)熵
經(jīng)驗(yàn)熵計(jì)算公式
H=?∑i=1np(xi)log2(p(xi))H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))H=?i=1∑n?p(xi?)log2?(p(xi?))
def empirical_entropy(self, dataset=None):"""求經(jīng)驗(yàn)熵$$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$:return: Float 經(jīng)驗(yàn)熵"""if dataset is None:dataset = self.DataSetcolumns_count = dataset.iloc[:, -1].value_counts()entropy = 0total_count = columns_count.sum()for count in columns_count:p = count / total_countentropy -= p * np.log2(p)return entropy條件熵
條件熵 H(Y∣X)H(Y|X)H(Y∣X)表示在已知隨機(jī)變量X的條件下隨機(jī)變量Y的不確定性。
定義X給定條件下Y的條件概率分布的熵對(duì)X的數(shù)學(xué)期望:
H(Y∣X)=∑i=1np(i)H(Y∣X=xi)H(Y|X) = \sum_{i=1}^np(i)H(Y|X=x_i)H(Y∣X)=i=1∑n?p(i)H(Y∣X=xi?)
信息增益
信息增益表示得知特征X的信息而使得類(lèi)Y的信息不確定性減少的程度。
即:選擇該特征對(duì)分類(lèi)的幫助程度。
在分類(lèi)問(wèn)題困難時(shí),也就是說(shuō)在訓(xùn)練數(shù)據(jù)集經(jīng)驗(yàn)熵大的時(shí)候,信息增益值會(huì)偏大,反之信息增益值會(huì)偏小。
使用信息增益比可以對(duì)這個(gè)問(wèn)題進(jìn)行校正,這是特征選擇的另一個(gè)標(biāo)準(zhǔn)。
特征A對(duì)訓(xùn)練數(shù)據(jù)集D的信息增益g(D,A),定義為集合D的經(jīng)驗(yàn)熵H(D)與特征A給定條件下D的經(jīng)驗(yàn)條件熵H(D|A)之差:
g(D,A)=H(D)?H(D∣A)g(D,A) = H(D)-H(D|A)g(D,A)=H(D)?H(D∣A)
ID3
簡(jiǎn)單來(lái)說(shuō),就是不斷選取能夠?qū)Ψ诸?lèi)提供最大效果的屬性,然后根據(jù)屬性的各個(gè)值選取接下來(lái)的最佳屬性
選擇信息增益最大的屬性
因?yàn)闂l件經(jīng)驗(yàn)熵越小(表示該分類(lèi)的結(jié)果比較統(tǒng)一,即信息增益越大)表示該屬性對(duì)于分類(lèi)重要性越大。
其中extract_dataset 相當(dāng)于在符合指定條件下數(shù)據(jù)集,用于接下來(lái)計(jì)算條件經(jīng)驗(yàn)熵,并獲得信息增益。
def extract_dataset(self, dataset: pd.DataFrame, column, label):"""根據(jù)column和label篩選出指定的數(shù)據(jù)集:return: pd.DataFrame 篩選后的數(shù)據(jù)集"""if type(column) == int:split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)else:split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)return split_datasetdef best_empirical_entropy(self, dataset: pd.DataFrame = None):"""選取數(shù)據(jù)集中的columns中,最好的column(經(jīng)驗(yàn)熵最大):param dataset: 帶選取的數(shù)據(jù)集:return: 返回column"""if dataset is None:dataset = self.DataSetcolumns = dataset.columns[:-1]total_count = dataset.shape[0]empirical_entropy = self.empirical_entropy(dataset)logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")logging.debug(f"empirical_entropy is {empirical_entropy}")informationGain_max = -1best_column = Nonefor column in columns:entropy_tmp = 0data_counts = dataset.loc[:, column].value_counts()data_labels = data_counts.indexlogging.debug(f"now is {column}")for label in data_labels:split_dataset = self.extract_dataset(dataset, column, label)count = split_dataset.shape[0]p = count / total_countentropy_tmp += p * self.empirical_entropy(split_dataset)logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "f"entropy: {self.empirical_entropy(split_dataset)}")informationGain = empirical_entropy - entropy_tmplogging.debug(f"entropy: {entropy_tmp}, {column} informationGain:{informationGain}")if informationGain > informationGain_max:best_column = columninformationGain_max = informationGainlogging.debug(f"Choose {best_column}:{informationGain_max}")return best_column過(guò)程
造成沒(méi)有可以選取的原因:因?yàn)榭赡芡粋€(gè)屬性,可能有不同結(jié)果。
擬合
def fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):'''擬合函數(shù),輸入數(shù)據(jù)集進(jìn)行擬合,其中如果y沒(méi)有輸入,則x的最后一列應(yīng)包含分類(lèi)結(jié)果:param x: pd.DataFrame數(shù)據(jù)集的屬性(當(dāng)y為None時(shí),為整個(gè)數(shù)據(jù)集-包含結(jié)果):param y: list like,shape=(-1,)數(shù)據(jù)集的結(jié)果:param algorithm: 選擇算法(目前僅有ID3):param threshold: 選擇信息增益的閾值:return: 決策樹(shù)的根節(jié)點(diǎn)'''self.check_dataset(x, dimension=2)self.check_dataset(y, dimension=1)self._threshold = thresholddataset = xif y is not None:dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)self.decision_tree = eval("self." + algorithm)(dataset)logging.info(f"decision_tree leaf:{self._leafCount}")return self.decision_tree預(yù)測(cè)
def predict(self, x: pd.DataFrame):'''預(yù)測(cè)數(shù)據(jù):param x:pd.DataFrame 輸入的數(shù)據(jù)集:return: 分類(lèi)結(jié)果'''self.y_predict = x.apply(self._predict_line, axis=1)return self.y_predictdef _predict_line(self, line):"""私有函數(shù),用于在predict中,對(duì)每一行數(shù)據(jù)進(jìn)行預(yù)測(cè):param line: 輸入的數(shù)據(jù)集的某一行數(shù)據(jù):return: 該一行的分類(lèi)結(jié)果"""tree = self.decision_treewhile True:try:if len(tree["next"]) == 1:return tree["next"]["其他"]else:value = line[tree["column"]]tree = tree["next"][value]except:return tree["next"]["其他"]評(píng)估
評(píng)估結(jié)果的準(zhǔn)確度,精確度,召回率。
- score評(píng)估函數(shù):僅適用于二分類(lèi),對(duì)于多分類(lèi)該算法不適用(但是決策樹(shù)代碼可以predict預(yù)測(cè))
- 同時(shí)score判斷正例需要結(jié)果為1,反例結(jié)果為0。
決策樹(shù)可視化
利用mermaid文本繪圖,將預(yù)測(cè)的值做了合并,同一屬性的不同值但是分類(lèi)結(jié)果相同,則可視化時(shí)都指向同一個(gè)輸出節(jié)點(diǎn)。
- 可視化函數(shù)提供了兩種輸出格式
- markdown格式
- html格式(推薦,使用瀏覽器即可查看決策樹(shù))
決策樹(shù)保存
def save(self, savePath: str):open(savePath, "w").write(str(decisionTree.decision_tree))logging.info(f"決策樹(shù)已保存,位置:{savePath}")決策樹(shù)讀取
def load(self, savePath: str):tree = eval(open(savePath, "r").read())if type(tree) == dict:self.decision_tree = treeelse:raise Exception("Load Faild!")效果圖
示例圖,非數(shù)據(jù)集分類(lèi)結(jié)果圖
總代碼
如何獲得每一步計(jì)算結(jié)果
不想要那么多過(guò)程,可以將開(kāi)頭的logging.basicConfig中的level設(shè)置為INFO即可。
即:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
修改為:
logging.basicConfig(level=logging.INFO, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
如果需要導(dǎo)出日志:
參數(shù)filename為輸出日志位置。
參數(shù)filemode為輸出日志寫(xiě)入模式。
logging.basicConfig(level=logging.DEBUG, filename='DecisionTree.log', filemode='w', format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
運(yùn)行代碼可能存在問(wèn)題
- 數(shù)據(jù)集不對(duì):Caesarian Section Classification Dataset下載后為arff格式,該代碼使用的數(shù)據(jù)集格式為csv,需要將arff中的數(shù)據(jù)提取出來(lái),可以使用記事本,將arff的數(shù)據(jù)部分保存為csv格式即可。
- 此外本代碼提供一個(gè)demo,無(wú)需外部數(shù)據(jù)集亦可運(yùn)行。
- score評(píng)估函數(shù):僅適用于二分類(lèi),對(duì)于多分類(lèi)該算法不適用(決策樹(shù)可以predict),同時(shí)score判斷正例需要結(jié)果為1,反例結(jié)果為0。
實(shí)驗(yàn)結(jié)果(決策樹(shù))
debug模式
使用demo數(shù)據(jù)集運(yùn)行
2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (14, 5), column is ['年齡', '有工作', '是學(xué)生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9402859586706311 2020-10-14 00:47:19,831-[root] [DEBUG] [best_empirical_entropy]: now is 年齡 2020-10-14 00:47:19,849-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,859-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (4, 4), Ans count: [4], entropy: 0.0 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.6935361388961918, 年齡 informationGain:0.24674981977443933 2020-10-14 00:47:19,868-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,880-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [4, 2], entropy: 0.9182958340544896 2020-10-14 00:47:19,889-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (4, 4), Ans count: [2, 2], entropy: 1.0 2020-10-14 00:47:19,896-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (4, 4), Ans count: [3, 1], entropy: 0.8112781244591328 2020-10-14 00:47:19,897-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9110633930116763, 有工作 informationGain:0.02922256565895487 2020-10-14 00:47:19,898-[root] [DEBUG] [best_empirical_entropy]: now is 是學(xué)生 2020-10-14 00:47:19,909-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (7, 4), Ans count: [6, 1], entropy: 0.5916727785823275 2020-10-14 00:47:19,917-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (7, 4), Ans count: [4, 3], entropy: 0.9852281360342515 2020-10-14 00:47:19,918-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.7884504573082896, 是學(xué)生 informationGain:0.15183550136234159 2020-10-14 00:47:19,920-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:19,927-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (8, 4), Ans count: [6, 2], entropy: 0.8112781244591328 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [3, 3], entropy: 1.0 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.8921589282623617, 信貸情況 informationGain:0.04812703040826949 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: Choose 年齡:0.24674981977443933 2020-10-14 00:47:19,940-[root] [DEBUG] [id3]: now choose_column:年齡, label: 2 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學(xué)生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:19,953-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,964-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9509775004326937, 有工作 informationGain:0.01997309402197489 2020-10-14 00:47:19,976-[root] [DEBUG] [best_empirical_entropy]: now is 是學(xué)生 2020-10-14 00:47:19,983-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9509775004326937, 是學(xué)生 informationGain:0.01997309402197489 2020-10-14 00:47:19,995-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,004-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.0, 信貸情況 informationGain:0.9709505944546686 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: Choose 信貸情況:0.9709505944546686 2020-10-14 00:47:20,015-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 0 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: select decision 1, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 1 2020-10-14 00:47:20,027-[root] [DEBUG] [id3]: select decision 0, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,028-[root] [DEBUG] [id3]: now choose_column:年齡, label: 0 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學(xué)生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:20,038-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:20,046-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,052-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (1, 3), Ans count: [1], entropy: 0.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.4, 有工作 informationGain:0.5709505944546686 2020-10-14 00:47:20,061-[root] [DEBUG] [best_empirical_entropy]: now is 是學(xué)生 2020-10-14 00:47:20,068-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.0, 是學(xué)生 informationGain:0.9709505944546686 2020-10-14 00:47:20,077-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,085-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9509775004326937, 信貸情況 informationGain:0.01997309402197489 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: Choose 是學(xué)生:0.9709505944546686 2020-10-14 00:47:20,094-[root] [DEBUG] [id3]: now choose_column:是學(xué)生, label: 0 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: select decision 0, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: now choose_column:是學(xué)生, label: 1 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: select decision 1, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: now choose_column:年齡, label: 1 2020-10-14 00:47:20,112-[root] [DEBUG] [id3]: select decision 1, result_type:[4], dataset column:(4, 4), lower than threshold:False 2020-10-14 00:47:20,112-[root] [INFO] [fit]: decision_tree leaf:5 2020-10-14 00:47:20,113-[root] [INFO] [save]: 決策樹(shù)已保存,位置:decisionTree.txt 2020-10-14 00:47:20,123-[root] [DEBUG] [score]: y_acutalTrue:9, y_acutalFalse:5, y_predictTrue:9, y_true:9, y_total:14總結(jié)
以上是生活随笔為你收集整理的【Python机器学习】决策树ID3算法结果可视化附源代码 对UCI数据集Caesarian Section进行分类的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Error Domain=NSCocoa
- 下一篇: python爬取cnnvd,粘贴可用