當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

lightgbm 决策树可视化 graphviz

發布時間：2023/11/28 生活经验 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 lightgbm 决策树可视化 graphviz 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

決策樹模型，XGBoost，LightGBM和CatBoost模型可視化
安裝 graphviz
參考文檔 http://graphviz.readthedocs.io/en/stable/manual.html#installation
graphviz安裝包下載地址 https://www.graphviz.org/download/
將graphviz的安裝位置添加到系統環境變量
使用pip install graphviz安裝graphviz python包
使用pip install pydotplus安裝pydotplus python包
決策樹模型可視化
以iris數據為例。訓練一個分類決策樹，調用export_graphviz函數導出DOT格式的文件。并用pydotplus包繪制圖片。

# 在環境變量中加入安裝的Graphviz路徑
import os
os.environ["PATH"] += os.pathsep + 'E:/Program Files (x86)/Graphviz2.38/bin'from sklearn import tree
from sklearn.datasets import load_irisiris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)import pydotplus
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names,  class_names=iris.target_names,  filled=True, rounded=True, special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

XGBoost模型可視化
參考文檔 https://xgboost.readthedocs.io/en/latest/python/python_api.html
xgboost中，對應的可視化函數是xgboost.to_graphviz。以iris數據為例，訓練一個xgb分類模型并可視化

# 在環境變量中加入安裝的Graphviz路徑
import os
os.environ["PATH"] += os.pathsep + 'E:/Program Files (x86)/Graphviz2.38/bin'import xgboost as xgb
from sklearn.datasets import load_iris
iris = load_iris()xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(iris.data, iris.target)
xgb.to_graphviz(xgb_clf, num_trees=1)

也可以通過Digraph對象可以將保存文件并查看

digraph = xgb.to_graphviz(xgb_clf, num_trees=1)
digraph.format = 'png'
digraph.view('./iris_xgb')

xgboost中提供了另一個api plot_tree，使用matplotlib可視化樹模型。效果上沒有graphviz清楚。

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10, 10))
ax = fig.subplots()
xgb.plot_tree(xgb_clf, num_trees=1, ax=ax)
plt.show()

LightGBM模型可視化
參考文檔 https://lightgbm.readthedocs.io/en/latest/Python-API.html#plotting
lgb中，對應的可視化函數是lightgbm.create_tree_digraph。以iris數據為例，訓練一個lgb分類模型并可視化

# 在環境變量中加入安裝的Graphviz路徑
import os
os.environ["PATH"] += os.pathsep + 'E:/Program Files (x86)/Graphviz2.38/bin'from sklearn.datasets import load_iris
import lightgbm as lgbiris = load_iris()
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(iris.data, iris.target)
lgb.create_tree_digraph(lgb_clf, tree_index=1)

lgb中提供了另一個api plot_tree，使用matplotlib可視化樹模型。效果上沒有graphviz清楚。

import matplotlib.pyplot as plt
fig2 = plt.figure(figsize=(20, 20))
ax = fig2.subplots()
lgb.plot_tree(lgb_clf, tree_index=1, ax=ax)
plt.show()

CatBoost模型可視化
參考文檔 https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
catboost并沒有提供模型可視化的api。唯一可以導出模型結構的api是save_model(fname, format=“cbm”, export_parameters=None)
以iris數據為例，訓練一個catboost模型。

from sklearn.datasets import load_iris
from catboost import CatBoostClassifieriris = load_iris()
cat_clf = CatBoostClassifier(iterations=100)
cat_clf.fit(iris.data, iris.target)

以python代碼格式保存模型文件

cat_clf.save_model('catboost_model_file.py', format="python", export_parameters=None)

也可以保存以C++代碼格式保存模型文件

cat_clf.save_model(‘catboost_model_file.cpp’, format=“cpp”, export_parameters=None)

查看保存到的python代碼，部分信息如下

需要自己解析出文件了樹的結構，再用 graphviz 繪制圖像

導出的Python文件
首先第一個for循環部分

binary_feature_index = 0
binary_features = [0] * model.binary_feature_count
for i in range(model.float_feature_count):for j in range(model.border_counts[i]):binary_features[binary_feature_index] = 1 if (float_features[i] > model.borders[binary_feature_index]) else 0binary_feature_index += 1

輸入的參數float_features存儲輸入的數值型特征值。model.binary_feature_count表示booster中所有樹的節點總數。model.border_counts存儲每個feature對應的節點數量，model.borders存儲所有節點的判斷邊界。顯然，CatBoost并沒有按照二叉樹結構從左到右，從上到下的存儲結構。此段代碼的功能，生成所有節點的判斷結果。如果特征值大于判斷邊界，表示為1，否則為0。存儲在binary_features中。

第二個for循環部分

# Extract and sum values from trees
result = 0.0
tree_splits_index = 0
current_tree_leaf_values_index = 0
for tree_id in range(model.tree_count):current_tree_depth = model.tree_depth[tree_id]index = 0for depth in range(current_tree_depth):index |= (binary_features[model.tree_splits[tree_splits_index + depth]] << depth)result += model.leaf_values[current_tree_leaf_values_index + index]tree_splits_index += current_tree_depthcurrent_tree_leaf_values_index += (1 << current_tree_depth)
return result

這段點代碼功能是生成模型的預測結果result。model.tree_count表示決策樹的數量，遍歷每棵決策樹。model.tree_depth存儲每棵決策樹的深度，取當前樹的深度，存儲在current_tree_depth。model.tree_splits存儲決策樹當前深度的節點在binary_features中的索引，每棵樹有current_tree_depth個節點。看似CatBoost模型存儲了都是完全二叉樹，而且每一層的節點以及該節點的判斷邊界一致。如一棵6層的決策，可以從binary_features中得到一個長度為6，值為0和1組成的list。model.leaf_values存儲所有葉子節點的值，每棵樹的葉子節點有(1 << current_tree_depth)個。將之前得到的list，倒序之后，看出一個2進制表示的數index，加上current_tree_leaf_values_index后，即是值在model.leaf_values的索引。將所有樹得到的值相加，得到CatBoost模型的結果。還原CatBoost模型樹
先從第二個for循環開始，打印每棵樹序號，樹的深度，當前樹節點索引在tree_splits的便宜了，已經每個節點對應在tree_splits中的值。這個值對應的是在第一個for循環中生成的binary_features的索引。

tree_splits_index = 0
current_tree_leaf_values_index = 0
for tree_id in range(tree_count):current_tree_depth = tree_depth[tree_id]tree_splits_list = []for depth in range(current_tree_depth):tree_splits_list.append(tree_splits[tree_splits_index + depth])print tree_id, current_tree_depth, tree_splits_index, tree_splits_listtree_splits_index += current_tree_depthcurrent_tree_leaf_values_index += (1 << current_tree_depth)

0 6 0 [96, 61, 104, 2, 52, 81]
1 6 6 [95, 99, 106, 44, 91, 14]
2 6 12 [96, 31, 81, 102, 16, 34]
3 6 18 [95, 105, 15, 106, 57, 111]
4 6 24 [95, 51, 30, 8, 75, 57]
5 6 30 [94, 96, 103, 104, 25, 33]
6 6 36 [60, 8, 25, 39, 15, 99]
7 6 42 [96, 27, 48, 50, 69, 111]
8 6 48 [61, 80, 71, 3, 45, 2]
9 4 54 [61, 21, 90, 37]

從第一個for循環可以看出，每個feature對應的節點都在一起，且每個feature的數量保存在model.border_counts。即可生成每個feature在binary_features的索引區間。

split_list = [0]
for i in range(len(border_counts)):split_list.append(split_list[-1] + border_counts[i])print border_counts
print zip(split_list[:-1], split_list[1:])

[32, 21, 39, 20]
[(0, 32), (32, 53), (53, 92), (92, 112)]

在拿到一個binary_features的索引后即可知道該索引對應的節點使用的特征序號（float_features的索引）。

def find_feature(tree_splits_index):for i in range(len(split_list) - 1):if split_list[i] <= tree_splits_index < split_list[i+1]:return i

有了節點在binary_features中的索引，該索引也對應特征的判斷邊界數值索引，也知道了如何根據索引獲取特征序號。決策樹索引信息都的得到了，現在可以繪制樹了。

繪制單棵決策樹
首先修改一下代碼，便于獲取單棵樹的節點

class CatBoostTree(object):def __init__(self, CatboostModel):self.model = CatboostModelself.split_list = [0]for i in range(self.model.float_feature_count):self.split_list.append(self.split_list[-1] + self.model.border_counts[i])def find_feature(self, splits_index):# 可優化成二分查找for i in range(self.model.float_feature_count):if self.split_list[i] <= splits_index < self.split_list[i+1]:return idef get_split_index(self, tree_id):tree_splits_index = 0current_tree_leaf_values_index = 0for index in range(tree_id):current_tree_depth = self.model.tree_depth[index]tree_splits_index += current_tree_depthcurrent_tree_leaf_values_index += (1 << current_tree_depth)return tree_splits_index, current_tree_leaf_values_indexdef get_tree_info(self, tree_id):tree_splits_index, current_tree_leaf_values_index = self.get_split_index(tree_id)current_tree_depth = self.model.tree_depth[tree_id]tree_splits_list = []for depth in range(current_tree_depth):tree_splits_list.append(self.model.tree_splits[tree_splits_index + depth])node_feature_list = [self.find_feature(index) for index in tree_splits_list]node_feature_borders = [self.model.borders[index] for index in tree_splits_list]end_tree_leaf_values_index = current_tree_leaf_values_index + (1 << current_tree_depth)tree_leaf_values = self.model.leaf_values[current_tree_leaf_values_index: end_tree_leaf_values_index]return current_tree_depth, node_feature_list, node_feature_borders, tree_leaf_values

下面是繪制一棵決策樹的函數，CatBoost導出的python代碼文件通過model_file參數傳入。

import impimport os
os.environ["PATH"] += os.pathsep + 'E:/Program Files (x86)/Graphviz2.38/bin'from graphviz import Digraphdef draw_tree(model_file, tree_id):fp, pathname, description = imp.find_module(model_file)CatboostModel = imp.load_module('CatboostModel', fp, pathname, description)catboost_tree = CatBoostTree(CatboostModel.CatboostModel)current_tree_depth, node_feature_list, node_feature_borders, tree_leaf_values = catboost_tree.get_tree_info(tree_id)dot = Digraph(name='tree_'+str(tree_id))for depth in range(current_tree_depth):node_name = str(node_feature_list[current_tree_depth - 1 - depth])node_border = str(node_feature_borders[current_tree_depth - 1 - depth])label = 'column_' + node_name + '>' + node_borderif depth == 0:dot.node(str(depth) + '_0', label)else:for j in range(1 << depth):dot.node(str(depth) + '_' + str(j), label)dot.edge(str(depth-1) + '_' + str(j//2), str(depth) + '_' + str(j), label='No' if j%2 == 0 else 'Yes')depth = current_tree_depthfor j in range(1 << depth):dot.node(str(depth) + '_' + str(j), str(tree_leaf_values[j]))dot.edge(str(depth-1) + '_' + str(j//2), str(depth) + '_' + str(j), label='No' if j%2 == 0 else 'Yes')# dot.format = 'png'path = dot.render('./' + str(tree_id), cleanup=True)print path

例如繪制第11棵樹（序數從0開始）。draw_tree(‘catboost_model_file’, 11)。

為了驗證這個對不對，需要對一個測試特征生成每棵樹的路徑和結果，抽查一兩個測試用例以及其中的一兩顆樹，觀察結果是否相同。

def test_tree(model_file, float_features):fp, pathname, description = imp.find_module(model_file)CatboostModel = imp.load_module('CatboostModel', fp, pathname, description)model = CatboostModel.CatboostModelcatboost_tree = CatBoostTree(CatboostModel.CatboostModel)result = 0for tree_id in range(model.tree_count):current_tree_depth, node_feature_list, node_feature_borders, tree_leaf_values = catboost_tree.get_tree_info(tree_id)route = []for depth in range(current_tree_depth):route.append(1 if float_features[node_feature_list[depth]] > node_feature_borders[depth] else 0)index = 0for depth in range(current_tree_depth):index |= route[depth] << depthtree_value = tree_leaf_values[index]print route, index, tree_valueresult += tree_valuereturn result

如我們生成了第11棵樹的圖像，根據測試測試特征，手動在圖像上查找可以得到一個值A。test_tree函數會打印一系列值，其中第11行對應的結果為值B。值A與值B相同，則測試為問題。
其次還需要測試所有樹的結果和導出文件中apply_catboost_model函數得到的結果相同。這個可以寫個腳本，拿訓練數據集跑一邊。

from catboost_model_file import apply_catboost_model
from CatBoostModelInfo import test_treefrom sklearn.datasets import load_irisdef main():iris = load_iris()# print iris.data# print iris.targetfor feature in iris.data:if apply_catboost_model(feature) != test_tree('catboost_model_file', feature):print False print 'End.'if __name__ == '__main__':main()

至此，CatBoost模型的可視化完成了。

可視化問題

1、ModuleNotFoundError: No module named 'graphviz’
原因：未安裝graphviz組件

2、graphviz.backend.ExecutableNotFound: failed to execute [‘dot’, ‘-Tpng’, ‘-O’, ‘tmp’], make sure the Graphviz executables are on your systems’ PATH
原因：未在系統中配置graphviz工具的環境變量

答案 :https://blog.csdn.net/az9996/article/details/86552979

總結

以上是生活随笔為你收集整理的lightgbm 决策树可视化 graphviz的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

生活经验

lightgbm 决策树 可视化 graphviz

總結

lightgbm 决策树可视化 graphviz