當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

《机器学习实战》笔记（03）：决策树

發布時間：2023/12/13 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了《机器学习实战》笔记（03）：决策树小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

決策樹

kNN算法可以完成很多分類任務，但是它最大的缺點就是給出數據的內在含義，決策樹的主要優勢就在于數據形式非常容易理解

決策樹的構造

優點：計算復雜度不高，輸出結果易于理解，對中間值的缺失不敏感，可以處理不相關特征數據。
缺點：可能會產生過度匹配問題。

適用數據類型：數值型和標稱型。

創建分支的偽代碼函數createBranch()

Check if every item in the dataset is in the same class:If so return the class labelElsefind the best feature to split the datasplit the datasetcreate a branch nodefor each splitcall createBranch() and add the result to the branch nodereturn branch node

示例數據

海洋生物數據

不浮出水面是否可以生存是否有腳蹼屬于魚類

是	是	是
是	是	是
是	否	否
否	是	否
否	是	否

信息增益 Information gain

劃分數據集的大原則是：將無序的數據變得更加有序。

組織雜亂無章數據的一種方法就是使用 信息論 度量信息。

在劃分數據集之前之后信息發生的變化稱為 信息增益.

知道如何計算信息增益，就可以計算某個特征值劃分數據集獲得的信息增益，獲得信息增益最高的特征就是最好的選擇。

馮諾依曼建議使用熵這術語

信息增益是熵(數據無序度)的減少，大家肯定對于將熵用于度量數據無序度的減少更容易理解。

集合信息的度量稱為香農熵 或者簡稱 熵(entropy)。（更多熵知識請移步至 What Is Information Entropy）

熵定義為信息的期望值

信息定義

如果待分類的事務可能劃分在多個分類之中，則符號Xi的信息定義為

其中p(Xi)是選擇該分類的概率。

為了計算熵，我們需要計算所有分類別所有可能值包含的信息期望值，通過下面的公式得到

trees.py

計算給定數據集的香農熵

def calcShannonEnt(dataSet):#實例總數numEntries = len(dataSet)labelCounts = {}#the the number of unique elements and their occurance#統計目標變量的值出現的次數for featVec in dataSet: #每個實例的最后一項是目標變量currentLabel = featVec[-1]if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0labelCounts[currentLabel] += 1shannonEnt = 0.0#利用上面的公式計算出香農熵for key in labelCounts:prob = float(labelCounts[key])/numEntriesshannonEnt -= prob * log(prob,2) #log base 2return shannonEnt

創建數據集

def createDataSet():dataSet = [[1, 1, 'yes'],[1, 1, 'yes'],[1, 0, 'no'],[0, 1, 'no'],[0, 1, 'no']]labels = ['no surfacing','flippers']#change to discrete valuesreturn dataSet, labels

運行

testTree.py

# -*- coding: utf-8 -*- import treesdataSet, labels = trees.createDataSet()print dataSet #[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] print labels #['no surfacing', 'flippers']#計算熵 print trees.calcShannonEnt(dataSet) #0.970950594455#改變多一個數據 dataSet[0][-1] = 'maybe'print dataSet #[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] print trees.calcShannonEnt(dataSet) #1.37095059445

熵越大，則混合的數據越多

延伸：另一個度量集合無序程度的方法是基尼不純度,簡單地說就是從一個數據集中隨機選取子項，度量其被錯誤分類到其他分組的概率。

劃分數據集

#axis表示第n列 #返回剔除第n列數據的數據集 def splitDataSet(dataSet, axis, value):retDataSet = []for featVec in dataSet:if featVec[axis] == value:#剔除第n列數據reducedFeatVec = featVec[:axis] reducedFeatVec.extend(featVec[axis+1:])retDataSet.append(reducedFeatVec)return retDataSet

運行

testTree.py

print dataSet #[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] #劃分數據集#當第0列時，值為0 的實例 print trees.splitDataSet(dataSet, 0, 0) #[[1, 'no'], [1, 'no']]#當第0列時，值為1 的實例 print trees.splitDataSet(dataSet, 0, 1) #[[1, 'yes'], [1, 'yes'], [0, 'no']] #append 和 extend 區別 >>> a=[1,2,3] >>> b=[4,5,6] >>> a.append(b) >>> a [1, 2, 3, [4, 5, 6]] >>> a=[1,2,3] >>> a.extend(b) >>> a [1, 2, 3, 4, 5, 6] #選擇最好的數據集劃分方式 def chooseBestFeatureToSplit(dataSet):#有多少個特征數量,最后一個目標變量numFeatures = len(dataSet[0]) - 1#計算基準香農熵目標變量的熵baseEntropy = calcShannonEnt(dataSet)bestInfoGain = 0.0; bestFeature = -1#迭代特征，i是列數for i in range(numFeatures): #該特征（一列）下所有值#使用列表推倒（List Comprehension）featList = [example[i] for example in dataSet] #特征值去重uniqueVals = set(featList)newEntropy = 0.0for value in uniqueVals:#返回剔除第i列數據的數據集subDataSet = splitDataSet(dataSet, i, value)prob = len(subDataSet)/float(len(dataSet))#新的香農熵#有點不清楚這公式newEntropy += prob * calcShannonEnt(subDataSet) #計算增益infoGain = baseEntropy - newEntropy#選擇最大增益，增益越大，區分越大if (infoGain > bestInfoGain):bestInfoGain = infoGainbestFeature = ireturn bestFeature

運行

print trees.chooseBestFeatureToSplit(dataSet) #0 #這運行結果告訴我們，第0特征是最好的用于劃分數據集的特征#chooseBestFeatureToSplit(dataSet)的一些中間變量的值 #baseEntropy: 0.970950594455 #第0列 #value: 0 #value: 1 #newEntropy: 0.550977500433#第1列 #value: 0 #value: 1 #newEntropy: 0.8

遞歸構建決策樹

返回出現次數最多的分類名稱

def majorityCnt(classList):classCount={}for vote in classList:if vote not in classCount.keys(): classCount[vote] = 0classCount[vote] += 1sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)return sortedClassCount[0][0]

創建樹的函數代碼

def createTree(dataSet,labels):#目標變量的值classList = [example[-1] for example in dataSet]#stop splitting when all of the classes are equalif classList.count(classList[0]) == len(classList): return classList[0]#stop splitting when there are no more features in dataSetif len(dataSet[0]) == 1: return majorityCnt(classList)bestFeat = chooseBestFeatureToSplit(dataSet)bestFeatLabel = labels[bestFeat]myTree = {bestFeatLabel:{}}del(labels[bestFeat])featValues = [example[bestFeat] for example in dataSet]uniqueVals = set(featValues)for value in uniqueVals:#copy all of labels, so trees don't mess up existing labelssubLabels = labels[:] myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)return myTree

運行結果

print "---createTree---"print trees.createTree(dataSet, labels)""" ---createTree--- classList: ['yes', 'yes', 'no', 'no', 'no'] baseEntropy: 0.970950594455 value: 0 value: 1 newEntropy: 0.550977500433 value: 0 value: 1 newEntropy: 0.8 --- classList: ['no', 'no'] --- classList: ['yes', 'yes', 'no'] baseEntropy: 0.918295834054 value: 0 value: 1 newEntropy: 0.0 --- classList: ['no'] --- classList: ['yes', 'yes']---最終運行結果--- {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}} """

在Python中使用Matplotlib注解繪制樹形圖

Matplotlib注解annotate

testCreatePlot.py

import matplotlib.pyplot as pltdecisionNode = dict(boxstyle="sawtooth", fc="0.8") leafNode = dict(boxstyle="round4", fc="0.8") arrow_args = dict(arrowstyle="<-")#繪制節點 def plotNode(nodeTxt, centerPt, parentPt, nodeType):createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',xytext=centerPt, textcoords='axes fraction',va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )def createPlot():fig = plt.figure(1, facecolor='white')fig.clf()createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)plt.show()print createPlot#<function createPlot at 0x0000000007636F98>createPlot()print createPlot.ax1#AxesSubplot(0.125,0.11;0.775x0.77)

注意：紅色的坐標是后來加上去的，不是上面程序生成的。

構造注解樹

獲取葉節點的數目和樹的層數

testPlotTree.py

def getNumLeafs(myTree):numLeafs = 0firstStr = myTree.keys()[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodesnumLeafs += getNumLeafs(secondDict[key])else: numLeafs +=1return numLeafsdef getTreeDepth(myTree):maxDepth = 0firstStr = myTree.keys()[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodesthisDepth = 1 + getTreeDepth(secondDict[key])else: thisDepth = 1if thisDepth > maxDepth: maxDepth = thisDepthreturn maxDepthdef retrieveTree(i):listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}]return listOfTrees[i]myTree = retrieveTree(0) print "myTree: " print myTreeprint "getNumLeafs(myTree): " print getNumLeafs(myTree)print "getTreeDepth(myTree): " print getTreeDepth(myTree)# myTree: # {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}} # getNumLeafs(myTree): # 3 # getTreeDepth(myTree): # 2

treePlotter.py

#在父子節點間填充文本信息 def plotMidText(cntrPt, parentPt, txtString):xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on#this determines the x width of this treenumLeafs = getNumLeafs(myTree) depth = getTreeDepth(myTree)#the text label for this node should be thisfirstStr = myTree.keys()[0]cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)plotMidText(cntrPt, parentPt, nodeTxt)plotNode(firstStr, cntrPt, parentPt, decisionNode)secondDict = myTree[firstStr]plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalDfor key in secondDict.keys():#test to see if the nodes are dictonaires, if not they are leaf nodesif type(secondDict[key]).__name__=='dict':#recursion遞歸調用plotTree(secondDict[key],cntrPt,str(key))else:#it's a leaf node print the leaf node繪制葉節點plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalWplotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD #if you do get a dictonary you know it's a tree, and the first element will be another dictdef createPlot(inTree):fig = plt.figure(1, facecolor='white')fig.clf()axprops = dict(xticks=[], yticks=[])#**axprops 表示 no ticks 不繪制坐標點createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses plotTree.totalW = float(getNumLeafs(inTree))plotTree.totalD = float(getTreeDepth(inTree))plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;#繪制跟節點plotTree(inTree, (0.5,1.0), '')plt.show()

運用

testPlotTree2.py

# -*- coding: utf-8 -*- import treePlottermyTree = treePlotter.retrieveTree(0) print myTree #{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}#開始繪制決策樹 treePlotter.createPlot(myTree)

測試算法：使用決策樹執行分類

def classify(inputTree,featLabels,testVec):firstStr = inputTree.keys()[0]secondDict = inputTree[firstStr]featIndex = featLabels.index(firstStr)key = testVec[featIndex]valueOfFeat = secondDict[key]if isinstance(valueOfFeat, dict): classLabel = classify(valueOfFeat, featLabels, testVec)else: classLabel = valueOfFeatreturn classLabel

運行

testClassify.py

import treePlotter import treesdataSet, labels = trees.createDataSet() myTree = treePlotter.retrieveTree(0)print myTreeprint trees.classify(myTree, labels, [1, 0]) #noprint trees.classify(myTree, labels, [1, 1]) #yes

決策樹存儲到本地

為了節省計算時間，最好能夠在每次執行分類時調用已經構造好的決策樹。

def storeTree(inputTree,filename):import picklefw = open(filename,'w')pickle.dump(inputTree,fw)fw.close()def grabTree(filename):import picklefr = open(filename)return pickle.load(fr)

運用

testStoreTree.py

import trees import treePlottermyTree = treePlotter.retrieveTree(0) #存儲到'classifierStorage.txt'文件 trees.storeTree(myTree, 'classifierStorage.txt')#再讀取 print trees.grabTree('classifierStorage.txt') #{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

示例：使用決策樹預測隱形眼鏡類型

數據集lenses.txt

age prescript藥方 astigmatic散光的 tearRate撕裂率young myope no reduced no lenses young myope no normal soft young myope yes reduced no lenses young myope yes normal hard young hyper no reduced no lenses young hyper no normal soft young hyper yes reduced no lenses young hyper yes normal hard pre myope no reduced no lenses pre myope no normal soft pre myope yes reduced no lenses pre myope yes normal hard pre hyper no reduced no lenses pre hyper no normal soft pre hyper yes reduced no lenses pre hyper yes normal no lenses presbyopic myope no reduced no lenses presbyopic myope no normal no lenses presbyopic myope yes reduced no lenses presbyopic myope yes normal hard presbyopic hyper no reduced no lenses presbyopic hyper no normal soft presbyopic hyper yes reduced no lenses presbyopic hyper yes normal no lenses

pre 之前
presbyopic 遠視眼的
myope 近視眼
hyper 超級

運用

testLenses.py

import trees import treePlotterfr=open('lenses.txt') lenses=[inst.strip().split('\t') for inst in fr.readlines()] lensesLabels=['age', 'prescript', 'astigmatic', 'tearRate'] lensesTree = trees.createTree(lenses,lensesLabels)print "lensesTree: " print lensesTree #{'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'prescript': {'hyper': {'age': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age': {'pre': 'soft', 'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}, 'young': 'soft'}}}}}}treePlotter.createPlot(lensesTree)

生成的決策樹圖

總結

上面決策樹非常好地匹配了實驗數據，然而這些匹配選項可能太多了。我們將這種問題稱之為過度匹配Overfitting。

為了減少過度匹配問題，可以裁剪決策樹，去掉一些不必要的葉子節點。

如果葉子節點只能增加少許信息，則可以刪除該節點，將它并入到其他葉子節點中。

上述闡述的是ID3算法，它是一個瑕不掩瑜的算法。

ID3算法無法直接處理數值型int,double的數據，盡管我們可以通過量化的方法將數值型轉換為標稱型數值，但是如果存在太多的特征劃分，ID3算法仍然會面臨其他問題。

總結

以上是生活随笔為你收集整理的《机器学习实战》笔记（03）：决策树的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Event Recommendation
下一篇： NLP复习资料(2)-三~五章：形式语言