當前位置：首頁 > 编程语言 > python >内容正文

python

python画决策树如何避免太多内容重叠在一起_Python手写决策树并应对过度拟合问题...

發布時間：2025/3/20 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 python画决策树如何避免太多内容重叠在一起_Python手写决策树并应对过度拟合问题... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

介紹

決策樹是一種用于監督學習的算法。它使用樹結構，其中包含兩種類型的節點：決策節點和葉節點。決策節點通過在要素上詢問布爾值將數據分為兩個分支。葉節點代表一個類。訓練過程是關于在具有特定特征的特定特征中找到“最佳”分割。預測過程是通過沿著路徑的每個決策節點回答問題來從根到達葉節點。

基尼不純度和熵

術語“最佳”拆分是指拆分之后，兩個分支比任何其他可能的拆分更“有序”。我們如何定義更多有序的？這取決于我們選擇哪種指標。通常，度量有兩種類型：基尼不純度和熵。這些指標越小，數據集就越“有序”。

這兩個指標之間的差異非常微妙。但在大多數應用中，兩個指標的行為類似。以下是用于計算每個指標的代碼。

def gini_impurity(y):

# calculate gini_impurity given labels/classes of each example

m = y.shape[0]

cnts = dict(zip(*np.unique(y, return_counts = True)))

impurity = 1 - sum((cnt/m)**2 for cnt in cnts.values())

return impurity

def entropy(y):

# calculate entropy given labels/classes of each example

m = y.shape[0]

cnts = dict(zip(*np.unique(y, return_counts = True)))

disorder = - sum((cnt/m)*log(cnt/m) for cnt in cnts.values())

return disorder

構建樹

訓練過程實質上是在樹上建樹。關鍵步驟是確定“最佳”分配。過程如下：我們嘗試按每個功能中的每個唯一值分割數據，然后選擇混亂程度最小的最佳數據。現在我們可以將此過程轉換為python代碼。

def get_split(X, y):

# loop through features and values to find best combination with the most information gain

best_gain, best_index, best_value = 0, None, None

cur_gini = gini_impurity(y)

n_features = X.shape[1]

for index in range(n_features):

values = np.unique(X[:, index], return_counts = False)

for value in values:

left, right = test_split(index, value, X, y)

if left['y'].shape[0] == 0 or right['y'].shape[0] == 0:

continue

gain = info_gain(left['y'], right['y'], cur_gini)

if gain > best_gain:

best_gain, best_index, best_value = gain, index, value

best_split = {'gain': best_gain, 'index': best_index, 'value': best_value}

return best_split

def test_split(index, value, X, y):

# split a group of examples based on given index (feature) and value

mask = X[:, index] < value

left = {'X': X[mask, :], 'y': y[mask]}

right = {'X': X[~mask, :], 'y': y[~mask]}

return left, right

def info_gain(l_y, r_y, cur_gini):

# calculate the information gain for a certain split

m, n = l_y.shape[0], r_y.shape[0]

p = m / (m + n)

return cur_gini - p * gini_impurity(l_y) - (1 - p) * gini_impurity(r_y)

在構建樹之前，我們先定義決策節點和葉節點。決策節點指定將在其上拆分的特征和值。它還指向左，右子項。葉節點包括類似于Counter對象的字典，該字典顯示每個類有多少訓練示例。這對于計算訓練的準確性很有用。另外，它導致到達該葉子的每個示例的結果預測。

class Decision_Node:

# define a decision node

def __init__(self, index, value, left, right):

self.index, self.value = index, value

self.left, self.right = left, right

class Leaf:

# define a leaf node

def __init__(self, y):

self.counts = dict(zip(*np.unique(y, return_counts = True)))

self.prediction = max(self.counts.keys(), key = lambda x: self.counts[x])

鑒于其結構，通過遞歸構造樹是最方便的。遞歸的出口是葉節點。當我們無法通過拆分提高數據純度時，就會發生這種情況。如果我們可以找到“最佳”拆分，則這將成為決策節點。然后，我們對其左，右子級遞歸執行相同的操作。

def decision_tree(X, y):

# train the decision tree model with a dataset

correct_prediction = 0

def build_tree(X, y):

# recursively build the tree

split = get_split(X, y)

if split['gain'] == 0:

nonlocal correct_prediction

leaf = Leaf(y)

correct_prediction += leaf.counts[leaf.prediction]

return leaf

left, right = test_split(split['index'], split['value'], X, y)

left_node = build_tree(left['X'], left['y'])

right_node = build_tree(right['X'], right['y'])

return Decision_Node(split['index'], split['value'], left_node, right_node)

root = build_tree(X, y)

return correct_prediction/y.shape[0], root

預測

現在我們可以遍歷樹直到葉節點來預測一個示例。

def predict(x, node):

if isinstance(node, Leaf):

return node.prediction

if x[node.index] < node.value:

return predict(x, node.left)

else:

return predict(x, node.right)

事實證明，訓練精度為100％，決策邊界看起來很奇怪！顯然，該模型過度擬合了訓練數據。好吧，如果考慮到這一點，如果我們繼續拆分直到數據集變得更純凈，決策樹將過度適合數據。換句話說，如果我們不停止分裂，該模型將正確分類每個示例！訓練準確性為100％(除非具有完全相同功能的不同類別的示例)，這絲毫不令人驚訝。

如何應對過度擬合？

從上一節中，我們知道決策樹過擬合的幕后原因。為了防止過度擬合，我們需要在某個時候停止拆分樹。因此，我們需要引入兩個用于訓練的超參數。它們是：樹的最大深度和葉子的最小尺寸。讓我們重寫樹的構建部分。

def decision_tree(X, y, max_dep = 5, min_size = 10):

# train the decision tree model with a dataset

correct_prediction = 0

def build_tree(X, y, dep, max_dep = max_dep, min_size = min_size):

# recursively build the tree

split = get_split(X, y)

if split['gain'] == 0 or dep >= max_dep or y.shape[0] <= min_size:

nonlocal correct_prediction

leaf = Leaf(y)

correct_prediction += leaf.counts[leaf.prediction]

return leaf

left, right = test_split(split['index'], split['value'], X, y)

left_node = build_tree(left['X'], left['y'], dep + 1)

right_node = build_tree(right['X'], right['y'], dep + 1)

return Decision_Node(split['index'], split['value'], left_node, right_node)

root = build_tree(X, y, 0)

return correct_prediction/y.shape[0], root

現在我們可以重新訓練數據并繪制決策邊界。

樹的可視化

接下來，我們將通過打印出決策樹的節點來可視化決策樹。節點的壓痕與其深度成正比。

def print_tree(node, indent = "|---"):

# print the tree

if isinstance(node, Leaf):

print(indent + 'Class:', node.prediction)

return

print(indent + 'feature_' + str(node.index) +

' <= ' + str(round(node.value, 2)))

print_tree(node.left, '| ' + indent)

print(indent + 'feature_' + str(node.index) +

' > ' + str(round(node.value, 2)))

print_tree(node.right, '| ' + indent)

結果如下：

|---feature_1 <= 1.87

| |---feature_1 <= -0.74

| | |---feature_1 <= -1.79

| | | |---feature_1 <= -2.1

| | | |---feature_1 > -2.1

| | |---feature_1 > -1.79

| | | |---feature_0 <= 1.62

| | | |---feature_0 > 1.62

| |---feature_1 > -0.74

| | |---feature_1 <= 0.76

| | | |---feature_0 <= 0.89

| | | |---feature_0 > 0.89

| | |---feature_1 > 0.76

| | | |---feature_0 <= -1.6

| | | |---feature_0 > -1.6

|---feature_1 > 1.87

| |---Class: 2

總結

與其他回歸模型不同，決策樹不使用正則化來對抗過度擬合。相反，它使用樹修剪。選擇正確的超參數(樹的深度和葉子的大小)還需要進行實驗，例如使用超參數矩陣進行交叉驗證。

作者：Jun M

deephub翻譯組

總結

以上是生活随笔為你收集整理的python画决策树如何避免太多内容重叠在一起_Python手写决策树并应对过度拟合问题...的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： airpods pro是按压还是触摸_A
下一篇： vb不能插入png图片_第16节-图片