日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

树回归源码分析(1)

發(fā)布時(shí)間:2024/9/20 编程问答 25 豆豆
生活随笔 收集整理的這篇文章主要介紹了 树回归源码分析(1) 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

線性回歸包含了強(qiáng)大的方法,但是需要擬合所有的數(shù)據(jù)集(局部加權(quán)線性回歸除外),但是當(dāng)數(shù)據(jù)特征復(fù)雜時(shí),構(gòu)建全局模型就難了,況且實(shí)際生活很多問(wèn)題都是非線性的,不可能使用全局線性模型來(lái)擬合所有的數(shù)據(jù)。
現(xiàn)有可以將數(shù)據(jù)集切分成很多易建模的數(shù)據(jù),然后再利用線性回歸技術(shù)建模,以就得到了CART——Classification And Regression Tree(分類(lèi)回歸樹(shù))的樹(shù)構(gòu)建算法,該算法既可以用于分類(lèi)還可以用于回歸。

二元切分法:即每次把數(shù)據(jù)集切成兩份。如果數(shù)據(jù)的某特征值等于切分所要求的值,那么這些數(shù)據(jù)就進(jìn)人樹(shù)的左子樹(shù),反之則進(jìn)人樹(shù)的右子樹(shù)。

使用二元切分法則易于對(duì)樹(shù)構(gòu)建過(guò)程進(jìn)行調(diào)整以處理連續(xù)型特征。具體的處理方法是:如果特征值大于給定值就走左子樹(shù),否則就走右子樹(shù)。

CART算法只做二元切分,所以這里可以固定樹(shù)的數(shù)據(jù)結(jié)構(gòu)。樹(shù)包含左鍵和右鍵,可以存儲(chǔ)另一棵子樹(shù)或者單個(gè)值。字典還包含特征和特征值這兩個(gè)鍵,它們給出切分算法所有的特征和特征值。

1. CART算法用于回歸

幾點(diǎn)要明確的地方:

  • ID3算法會(huì)在給定節(jié)點(diǎn)時(shí)計(jì)算數(shù)據(jù)的混亂度,而連續(xù)數(shù)值的混亂度的度量用方差(平方誤差的均值),這里用總方差(平方誤差的總值),總方差可以通過(guò)均方差乘以數(shù)據(jù)集中的樣本點(diǎn)的個(gè)數(shù)來(lái)得到。
  • 源代碼有錯(cuò)誤的地方,解決方法:源代碼錯(cuò)誤修正
# -*- coding: utf-8 -*- """ Created on Fri Nov 03 10:35:00 2017""" from numpy import *# 加載數(shù)據(jù)函數(shù) def loadDataSet(fileName): dataMat = [] fr = open(fileName)for line in fr.readlines():curLine = line.strip().split('\t') # 讀取以tab鍵為分割符的文件fltLine = map(float,curLine) # 將每行映射為浮點(diǎn)數(shù)dataMat.append(fltLine) # 把所有的數(shù)據(jù)保存到一起return dataMat# 二元切分?jǐn)?shù)據(jù)集 def binSplitDataSet(dataSet, feature, value): # 三個(gè)參數(shù):數(shù)據(jù)集合,待切分的特征,和該特征的某個(gè)值mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:] # 數(shù)組過(guò)濾,mat0是特征數(shù)列中大于value的所有樣本行mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:] # 得到和feature相對(duì)應(yīng)的滿(mǎn)足要求的樣本return mat0,mat1 # 返回兩個(gè)子集,分別是針對(duì)某特征列劃分的不同樣本集# 生成葉節(jié)點(diǎn) def regLeaf(dataSet): return mean(dataSet[:,-1]) # 在回歸樹(shù)種返回目標(biāo)變量的均值# 誤差估計(jì)函數(shù),計(jì)算連續(xù)值的混亂度 def regErr(dataSet): # var()均方差函數(shù),要返回總方差,所以要用均方差乘以數(shù)據(jù)集中的樣本個(gè)數(shù)return var(dataSet[:,-1]) * shape(dataSet)[0] # 用最佳方式切分?jǐn)?shù)據(jù)集和生成相應(yīng)的葉節(jié)點(diǎn)。leafType,errType是對(duì)函數(shù)的引用 def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):tolS = ops[0]; tolN = ops[1] # tolS容許的誤差下降值,tolN切分的最少樣本數(shù)if len(set(dataSet[:,-1].T.tolist()[0])) == 1: # 如果特征數(shù)目只剩一個(gè),就不再切分,直接返回print 'back from here 1 ..'return None, leafType(dataSet)m,n = shape(dataSet) # 當(dāng)前數(shù)據(jù)集的大小S = errType(dataSet) # 計(jì)算誤差,s用于和新切分誤差對(duì)比bestS = inf; bestIndex = 0; bestValue = 0 for featIndex in range(n-1): # 遍歷所有的特征,除了最后一個(gè)for splitVal in set(dataSet[:,featIndex].T.A.tolist()[0]): # 針對(duì)每個(gè)特征,在所有樣本中查看不同的特征值 mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal) # 將數(shù)據(jù)集切分兩份 if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 切分的最少樣本數(shù) continue newS = errType(mat0) + errType(mat1) # 計(jì)算切分的總方差 if newS < bestS: print 'featIndex,splitVal:',featIndex,'and', splitValbestIndex = featIndex # 如果新的總方差小于當(dāng)前的方差,則返回特征索引和切分特征值bestValue = splitValbestS = newS if (S - bestS) < tolS: # 如果容錯(cuò)的誤差下降值變化不大,就停止切分,直接創(chuàng)造葉節(jié)點(diǎn)print 'back from here 2 ..'return None, leafType(dataSet) mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 如果切分的數(shù)據(jù)集很小則退出直接創(chuàng)造葉節(jié)點(diǎn)print 'back from here 3 ..'return None, leafType(dataSet)return bestIndex,bestValue # 如果所有的提前終止條件都不滿(mǎn)足,就返回切分特征和特征值# 找到數(shù)據(jù)的最佳二元切分方式 def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)): # ops是一個(gè)包含樹(shù)構(gòu)建所需的參數(shù)元組# 把數(shù)據(jù)集分成兩部分,如果滿(mǎn)足停止條件返回None和某類(lèi)模型的值# 滿(mǎn)足停止條件:feat是None,val是某類(lèi)模型的值feat, val = chooseBestSplit(dataSet, leafType, errType, ops)print 'feat, val :',feat, 'and',valif feat == None:print 'back creatTree..'return val # 回歸樹(shù):模型是常數(shù),模型樹(shù):模型是線性方程 retTree = {} retTree['spInd'] = featretTree['spVal'] = vallSet, rSet = binSplitDataSet(dataSet, feat, val) # 不滿(mǎn)足停止條件時(shí),lSet, rSet是兩個(gè)數(shù)據(jù)集retTree['left'] = createTree(lSet, leafType, errType, ops) # 遞歸調(diào)用createTree()函數(shù)retTree['right'] = createTree(rSet, leafType, errType, ops)return retTree # 主函數(shù) testMat=mat(eye(4)) mat0,mat1=binSplitDataSet(testMat,1,0.5) print 'mat0:',mat0 print 'mat1:',mat1myDat=loadDataSet('ex00.txt') myMat=mat(myDat) print createTree(myMat)

運(yùn)行結(jié)果:

mat0: [[ 0. 1. 0. 0.]] mat1: [[ 1. 0. 0. 0.][ 0. 0. 1. 0.][ 0. 0. 0. 1.]] featIndex,splitVal: 0 and 0.302001 featIndex,splitVal: 0 and 0.55299 featIndex,splitVal: 0 and 0.378595 featIndex,splitVal: 0 and 0.406649 featIndex,splitVal: 0 and 0.475976 featIndex,splitVal: 0 and 0.48813 feat, val : 0 and 0.48813 featIndex,splitVal: 0 and 0.936783 featIndex,splitVal: 0 and 0.727098 featIndex,splitVal: 0 and 0.72312 featIndex,splitVal: 0 and 0.645762 featIndex,splitVal: 0 and 0.648675 featIndex,splitVal: 0 and 0.625336 featIndex,splitVal: 0 and 0.622398 featIndex,splitVal: 0 and 0.620599 back from here 2 .. feat, val : None and 1.01809676724 back creatTree.. featIndex,splitVal: 0 and 0.302001 featIndex,splitVal: 0 and 0.347837 featIndex,splitVal: 0 and 0.346986 featIndex,splitVal: 0 and 0.188218 featIndex,splitVal: 0 and 0.048014 featIndex,splitVal: 0 and 0.343479 back from here 2 .. feat, val : None and -0.0446502857143 back creatTree.. {'spInd': 0, 'spVal': 0.48813, 'right': -0.044650285714285719, 'left': 1.0180967672413792}

可以由運(yùn)行結(jié)果看出代碼的具體運(yùn)行過(guò)程:

  • 葉節(jié)點(diǎn)是相應(yīng)的目標(biāo)數(shù)據(jù)集的均值
  • 注意幾個(gè)切分停止得條件和返回葉節(jié)點(diǎn)

2. 樹(shù)剪枝

通過(guò)降低決策樹(shù)的復(fù)雜度來(lái)避免過(guò)擬合的過(guò)程稱(chēng)為剪枝,在上面的提前終止條件實(shí)際是一種預(yù)剪枝的操作。另一種是使用測(cè)試集和訓(xùn)練集,稱(chēng)為后剪枝。

  • 樹(shù)構(gòu)建算法其實(shí)對(duì)輸入的tolS和tolN非常敏感,也就是對(duì)提前終止的人為輸入?yún)?shù),其中tolS對(duì)誤差的數(shù)量級(jí)十分敏感,所以需要我們手動(dòng)調(diào)節(jié)參數(shù),但是通過(guò)不斷修改停止條件來(lái)得到合理的結(jié)果并不是很好的辦法,甚至有時(shí)候我們不確定到底我們需要什么樣的結(jié)果,于是有了通過(guò)測(cè)試集來(lái)對(duì)樹(shù)進(jìn)行剪枝,也就避免了用戶(hù)指定參數(shù)。

后剪枝

函數(shù)prune()的偽代碼如下:

基于已有的樹(shù)切分測(cè)試數(shù)據(jù):

  • 如果存在任一子集是一棵樹(shù),則在該子集遞歸剪枝過(guò)程
  • 計(jì)算將當(dāng)前兩個(gè)葉節(jié)點(diǎn)合并后的誤差
  • 計(jì)算不合并的誤差
  • 如果合并會(huì)降低誤差的話,就將葉節(jié)點(diǎn)合并
# -*- coding: utf-8 -*- """ Created on Fri Nov 03 10:35:00 2017""" from numpy import *# 加載數(shù)據(jù)函數(shù) def loadDataSet(fileName): dataMat = [] fr = open(fileName)for line in fr.readlines():curLine = line.strip().split('\t') # 讀取以tab鍵為分割符的文件fltLine = map(float,curLine) # 將每行映射為浮點(diǎn)數(shù)dataMat.append(fltLine) # 把所有的數(shù)據(jù)保存到一起return dataMat# 二元切分?jǐn)?shù)據(jù)集 def binSplitDataSet(dataSet, feature, value): # 三個(gè)參數(shù):數(shù)據(jù)集合,待切分的特征,和該特征的某個(gè)值mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:] # 數(shù)組過(guò)濾,mat0是特征數(shù)列中大于value的所有樣本行mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:] # 得到和feature相對(duì)應(yīng)的滿(mǎn)足要求的樣本return mat0,mat1 # 返回兩個(gè)子集,分別是針對(duì)某特征列劃分的不同樣本集# 生成葉節(jié)點(diǎn) def regLeaf(dataSet): return mean(dataSet[:,-1]) # 在回歸樹(shù)種返回目標(biāo)變量的均值# 誤差估計(jì)函數(shù),計(jì)算連續(xù)值的混亂度 def regErr(dataSet): # var()均方差函數(shù),要返回總方差,所以要用均方差乘以數(shù)據(jù)集中的樣本個(gè)數(shù)return var(dataSet[:,-1]) * shape(dataSet)[0] # 用最佳方式切分?jǐn)?shù)據(jù)集和生成相應(yīng)的葉節(jié)點(diǎn)。leafType,errType是對(duì)函數(shù)的引用 def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):tolS = ops[0]; tolN = ops[1] # tolS容許的誤差下降值,tolN切分的最少樣本數(shù)if len(set(dataSet[:,-1].T.tolist()[0])) == 1: # 如果特征數(shù)目只剩一個(gè),就不再切分,直接返回print 'back from here 1 ..'return None, leafType(dataSet)m,n = shape(dataSet) # 當(dāng)前數(shù)據(jù)集的大小S = errType(dataSet) # 計(jì)算誤差,s用于和新切分誤差對(duì)比bestS = inf; bestIndex = 0; bestValue = 0 for featIndex in range(n-1): # 遍歷所有的特征,除了最后一個(gè)for splitVal in set(dataSet[:,featIndex].T.A.tolist()[0]): # 針對(duì)每個(gè)特征,在所有樣本中查看不同的特征值 mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal) # 將數(shù)據(jù)集切分兩份 if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 切分的最少樣本數(shù) continue newS = errType(mat0) + errType(mat1) # 計(jì)算切分的總方差 if newS < bestS: print 'featIndex,splitVal,newS:',featIndex,'and', splitVal,'and', newSbestIndex = featIndex # 如果新的總方差小于當(dāng)前的方差,則返回特征索引和切分特征值bestValue = splitValbestS = newS if (S - bestS) < tolS: # 如果容錯(cuò)的誤差下降值變化不大,就停止切分,直接創(chuàng)造葉節(jié)點(diǎn)print 'back from here 2 ..'return None, leafType(dataSet) mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 如果切分的數(shù)據(jù)集很小則退出直接創(chuàng)造葉節(jié)點(diǎn)print 'back from here 3 ..'return None, leafType(dataSet)return bestIndex,bestValue # 如果所有的提前終止條件都不滿(mǎn)足,就返回切分特征和特征值# 找到數(shù)據(jù)的最佳二元切分方式 def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)): # ops是一個(gè)包含樹(shù)構(gòu)建所需的參數(shù)元組# 把數(shù)據(jù)集分成兩部分,如果滿(mǎn)足停止條件返回None和某類(lèi)模型的值# 滿(mǎn)足停止條件:feat是None,val是某類(lèi)模型的值feat, val = chooseBestSplit(dataSet, leafType, errType, ops)print 'feat, val :',feat, 'and',valif feat == None:print 'back creatTree..'return val # 回歸樹(shù):模型是常數(shù),模型樹(shù):模型是線性方程 retTree = {} retTree['spInd'] = featretTree['spVal'] = vallSet, rSet = binSplitDataSet(dataSet, feat, val) # 不滿(mǎn)足停止條件時(shí),lSet, rSet是兩個(gè)數(shù)據(jù)集retTree['left'] = createTree(lSet, leafType, errType, ops) # 遞歸調(diào)用createTree()函數(shù)retTree['right'] = createTree(rSet, leafType, errType, ops)return retTree # 回歸樹(shù)剪枝函數(shù) def isTree(obj): # 測(cè)試一個(gè)輸入變量是否是樹(shù)類(lèi)型return (type(obj).__name__=='dict') # 判斷當(dāng)前處理的節(jié)點(diǎn)是否是葉節(jié)點(diǎn)def getMean(tree): # 葉節(jié)點(diǎn)處理函數(shù),是一個(gè)遞歸函數(shù),從上往下遍歷樹(shù)直到葉節(jié)點(diǎn)為止if isTree(tree['right']): tree['right'] = getMean(tree['right'])if isTree(tree['left']): tree['left'] = getMean(tree['left'])return (tree['left']+tree['right'])/2.0 # 返回整個(gè)樹(shù)的平均值def prune(tree, testData): # testData待測(cè)試的數(shù)據(jù),tree是由訓(xùn)練集生成的if shape(testData)[0] == 0: # 如果沒(méi)有測(cè)試數(shù)據(jù),則對(duì)樹(shù)進(jìn)行塌陷處理return getMean(tree) if (isTree(tree['right']) or isTree(tree['left'])): # 判斷當(dāng)前分支是否是樹(shù)lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])if isTree(tree['left']): # 左樹(shù)剪枝tree['left'] = prune(tree['left'], lSet) # 反復(fù)調(diào)用prune()對(duì)測(cè)試數(shù)據(jù)進(jìn)行切分if isTree(tree['right']): # 右樹(shù)剪枝tree['right'] = prune(tree['right'], rSet)# 如果左右兩個(gè)不再是子樹(shù),就進(jìn)行合并if not isTree(tree['left']) and not isTree(tree['right']):lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])errorNoMerge = sum(power(lSet[:,-1] - tree['left'],2)) \+sum(power(rSet[:,-1] - tree['right'],2)) # 此處用的是平方誤差treeMean = (tree['left']+tree['right'])/2.0errorMerge = sum(power(testData[:,-1] - treeMean,2))if errorMerge < errorNoMerge: # 比較剪枝前后的誤差變化print "merging"return treeMeanelse: return treeelse: return tree# 主函數(shù) myDat2=loadDataSet('ex2.txt') myMat2=mat(myDat2) myTree=createTree(myMat2,ops=(0,1)) # createTree(myMat)返回的值是dict類(lèi)型的 myDatTest=loadDataSet('ex2test.txt') myMat2Test=mat(myDatTest) print '..............................' print prune(myTree,myMat2Test)

運(yùn)行結(jié)果:

merging merging merging merging merging merging merging merging merging merging merging merging merging ... 'spVal': 0.965969, 'right': {'spInd': 0, 'spVal': 0.956951, 'right': 111.2013225, 'left': {'spInd': 0, 'spVal': 0.958512, 'right': 135.83701300000001, 'left': {'spInd': 0, 'spVal': 0.960398, 'right': 123.559747, 'left': 112.386764}}}, 'left': 92.523991499999994}}}}

可以看出大量的節(jié)點(diǎn)已經(jīng)被剪枝掉了,雖然看著還是那么多的節(jié)點(diǎn),但是確實(shí)已經(jīng)減少了很多了,一般情況下為了尋求最佳模型可以同時(shí)使用預(yù)剪枝和后剪枝兩種技術(shù)。

注意:

  • 其中的塌陷處理,自上而下的遍歷樹(shù)到葉節(jié)點(diǎn)為止,如果找到兩個(gè)葉節(jié)點(diǎn)則計(jì)算它們的平均值,返回整個(gè)樹(shù)的平均值。
  • 注意其中的遞歸調(diào)用剪枝處理,對(duì)數(shù)據(jù)結(jié)構(gòu)的理解有所要求。

3. 模型樹(shù)

簡(jiǎn)單來(lái)說(shuō)就是把原來(lái)的葉節(jié)點(diǎn)由常數(shù)值變?yōu)榉侄尉€性函數(shù),所謂的分段線性就是指模型由多個(gè)線性片段組成。也就是在某些情況下,分段線性要比很多節(jié)點(diǎn)組成的一顆大樹(shù)更容易解釋。

  • 模型樹(shù)的可解釋性?xún)?yōu)于回歸樹(shù)的,另外模型樹(shù)也具有更高的預(yù)測(cè)準(zhǔn)確度。
  • 前面用于回歸樹(shù)的誤差計(jì)算方法這里不能再用。稍加變化,對(duì)于給定的數(shù)據(jù)集,應(yīng)該先用線性的模型來(lái)對(duì)它進(jìn)行擬合,然后計(jì)算真實(shí)的目標(biāo)值與模型預(yù)測(cè)值間的差值。最后將這些差值的平方求和就得到了所需的誤差

在CART算法用于回歸代碼中加入下面的函數(shù),并且把主函數(shù)改為如下:

# 模型樹(shù)的葉節(jié)點(diǎn)生成函數(shù) def linearSolve(dataSet): m,n = shape(dataSet)X = mat(ones((m,n))); Y = mat(ones((m,1))) X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1] # 將數(shù)據(jù)集格式化成目標(biāo)變量和自變量 xTx = X.T*Xif linalg.det(xTx) == 0.0: # 這個(gè)矩陣是奇異的,不能求逆raise NameError('This matrix is singular, cannot do inverse,\n\try increasing the second value of ops')ws = xTx.I * (X.T * Y) # 線性回歸的系數(shù)return ws,X,Ydef modelLeaf(dataSet): # 生成葉節(jié)點(diǎn)模型ws,X,Y = linearSolve(dataSet)return ws # 返回回歸系數(shù)def modelErr(dataSet):ws,X,Y = linearSolve(dataSet)yHat = X * wsreturn sum(power(Y - yHat,2)) # 在給定數(shù)據(jù)集上計(jì)算誤差,返回平方誤差# 主函數(shù)# 模型樹(shù) myMat2=mat(loadDataSet('exp2.txt')) modelTree=createTree(myMat2, modelLeaf,modelErr,(1,10)) print '模型樹(shù):',modelTree

運(yùn)行結(jié)果:

模型樹(shù): {'spInd': 0, 'spVal': 0.285477, 'right': matrix([[ 3.46877936],[ 1.18521743]]), 'left': matrix([[ 1.69855694e-03],[ 1.19647739e+01]])}

由運(yùn)行的結(jié)果可以看出:
分段線性生成的模型:
y=3.468+1.18521743x
y=0.0016985+11.96477x
而數(shù)據(jù)是由模型:
y=3.5+1.0x
y=0.0+12x再加上高斯噪聲生成的。

兩個(gè)模型已經(jīng)非常接近了。

4. 樹(shù)回歸的比較

模型樹(shù)、回歸樹(shù)以及第8章里的其他模型,哪一種模型更好呢?一個(gè)比較客觀的方法是計(jì)算相關(guān)系數(shù),也稱(chēng)為R2值。該相關(guān)系數(shù)可以通過(guò)調(diào)用Numpy庫(kù)中的命令corrcoef(yHat,y,rowvar)來(lái)求解。

# -*- coding: utf-8 -*- """ Created on Fri Nov 03 10:35:00 2017""" from numpy import *# 加載數(shù)據(jù)函數(shù) def loadDataSet(fileName): dataMat = [] fr = open(fileName)for line in fr.readlines():curLine = line.strip().split('\t') # 讀取以tab鍵為分割符的文件fltLine = map(float,curLine) # 將每行映射為浮點(diǎn)數(shù)dataMat.append(fltLine) # 把所有的數(shù)據(jù)保存到一起return dataMat# 二元切分?jǐn)?shù)據(jù)集 def binSplitDataSet(dataSet, feature, value): # 三個(gè)參數(shù):數(shù)據(jù)集合,待切分的特征,和該特征的某個(gè)值mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:] # 數(shù)組過(guò)濾,mat0是特征數(shù)列中大于value的所有樣本行mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:] # 得到和feature相對(duì)應(yīng)的滿(mǎn)足要求的樣本return mat0,mat1 # 返回兩個(gè)子集,分別是針對(duì)某特征列劃分的不同樣本集# 生成葉節(jié)點(diǎn) def regLeaf(dataSet): return mean(dataSet[:,-1]) # 在回歸樹(shù)種返回目標(biāo)變量的均值# 誤差估計(jì)函數(shù),計(jì)算連續(xù)值的混亂度 def regErr(dataSet): # var()均方差函數(shù),要返回總方差,所以要用均方差乘以數(shù)據(jù)集中的樣本個(gè)數(shù)return var(dataSet[:,-1]) * shape(dataSet)[0] # 用最佳方式切分?jǐn)?shù)據(jù)集和生成相應(yīng)的葉節(jié)點(diǎn)。leafType,errType是對(duì)函數(shù)的引用 def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):tolS = ops[0]; tolN = ops[1] # tolS容許的誤差下降值,tolN切分的最少樣本數(shù)if len(set(dataSet[:,-1].T.tolist()[0])) == 1: # 如果特征數(shù)目只剩一個(gè),就不再切分,直接返回print 'back from here 1 ..'return None, leafType(dataSet)m,n = shape(dataSet) # 當(dāng)前數(shù)據(jù)集的大小S = errType(dataSet) # 計(jì)算誤差,s用于和新切分誤差對(duì)比bestS = inf; bestIndex = 0; bestValue = 0 for featIndex in range(n-1): # 遍歷所有的特征,除了最后一個(gè)for splitVal in set(dataSet[:,featIndex].T.A.tolist()[0]): # 針對(duì)每個(gè)特征,在所有樣本中查看不同的特征值 mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal) # 將數(shù)據(jù)集切分兩份 if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 切分的最少樣本數(shù) continue newS = errType(mat0) + errType(mat1) # 計(jì)算切分的總方差 if newS < bestS: print 'featIndex,splitVal:',featIndex,'and', splitValbestIndex = featIndex # 如果新的總方差小于當(dāng)前的方差,則返回特征索引和切分特征值bestValue = splitValbestS = newS if (S - bestS) < tolS: # 如果容錯(cuò)的誤差下降值變化不大,就停止切分,直接創(chuàng)造葉節(jié)點(diǎn)print 'back from here 2 ..'return None, leafType(dataSet) mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): # 如果切分的數(shù)據(jù)集很小則退出直接創(chuàng)造葉節(jié)點(diǎn)print 'back from here 3 ..'return None, leafType(dataSet)return bestIndex,bestValue # 如果所有的提前終止條件都不滿(mǎn)足,就返回切分特征和特征值# 找到數(shù)據(jù)的最佳二元切分方式 def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)): # ops是一個(gè)包含樹(shù)構(gòu)建所需的參數(shù)元組# 把數(shù)據(jù)集分成兩部分,如果滿(mǎn)足停止條件返回None和某類(lèi)模型的值# 滿(mǎn)足停止條件:feat是None,val是某類(lèi)模型的值feat, val = chooseBestSplit(dataSet, leafType, errType, ops)print 'feat, val :',feat, 'and',valif feat == None:print 'back creatTree..'return val # 回歸樹(shù):模型是常數(shù),模型樹(shù):模型是線性方程 retTree = {} retTree['spInd'] = featretTree['spVal'] = vallSet, rSet = binSplitDataSet(dataSet, feat, val) # 不滿(mǎn)足停止條件時(shí),lSet, rSet是兩個(gè)數(shù)據(jù)集retTree['left'] = createTree(lSet, leafType, errType, ops) # 遞歸調(diào)用createTree()函數(shù)retTree['right'] = createTree(rSet, leafType, errType, ops)return retTree # 模型樹(shù)的葉節(jié)點(diǎn)生成函數(shù) def linearSolve(dataSet): m,n = shape(dataSet)X = mat(ones((m,n))); Y = mat(ones((m,1))) X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1] # 將數(shù)據(jù)集格式化成目標(biāo)變量和自變量 xTx = X.T*Xif linalg.det(xTx) == 0.0: # 這個(gè)矩陣是奇異的,不能求逆raise NameError('This matrix is singular, cannot do inverse,\n\try increasing the second value of ops')ws = xTx.I * (X.T * Y) # 線性回歸的系數(shù)return ws,X,Y# 生成葉節(jié)點(diǎn)模型,返回回歸系數(shù) def modelLeaf(dataSet): ws,X,Y = linearSolve(dataSet)return ws # 在給定數(shù)據(jù)集上計(jì)算誤差,返回平方誤差 def modelErr(dataSet):ws,X,Y = linearSolve(dataSet)yHat = X * wsreturn sum(power(Y - yHat,2)) def isTree(obj):return (type(obj).__name__=='dict')# 對(duì)回歸樹(shù)節(jié)點(diǎn)進(jìn)行預(yù)測(cè) def regTreeEval(model, inDat):return float(model) # 返回樹(shù)預(yù)測(cè)的值# 對(duì)模型樹(shù)節(jié)點(diǎn)預(yù)測(cè) def modelTreeEval(model, inDat):n = shape(inDat)[1] # 格式化處理X = mat(ones((1,n+1))) # 在原數(shù)據(jù)矩陣上增加第0列X[:,1:n+1]=inDatreturn float(X*model)# 自頂向下的遍歷整棵樹(shù),直到命中葉節(jié)點(diǎn)為止 def treeForeCast(tree, inData, modelEval=regTreeEval):if not isTree(tree): # 判斷是否是樹(shù)字典,如果不是就返回return modelEval(tree, inData)if inData[tree['spInd']] > tree['spVal']:if isTree(tree['left']): return treeForeCast(tree['left'], inData, modelEval)else: return modelEval(tree['left'], inData)else:if isTree(tree['right']): return treeForeCast(tree['right'], inData, modelEval)else: return modelEval(tree['right'], inData)def createForeCast(tree, testData, modelEval=regTreeEval):m=len(testData)yHat = mat(zeros((m,1)))for i in range(m):yHat[i,0] = treeForeCast(tree, mat(testData[i]), modelEval)return yHat# 利用訓(xùn)練數(shù)據(jù)構(gòu)造回歸樹(shù) trainMat=mat(loadDataSet('bikeSpeedVsIq_train.txt')) testMat=mat(loadDataSet('bikeSpeedVsIq_test.txt')) myTree=createTree(trainMat,ops=(1,20)) # 利用訓(xùn)練數(shù)據(jù)構(gòu)造回歸樹(shù) yHat=createForeCast(myTree, testMat[:,0]) coefficient_regtree=corrcoef(yHat,testMat[:,1],rowvar=0)[0,1]# 利用訓(xùn)練數(shù)據(jù)構(gòu)造模型樹(shù) myTree=createTree(trainMat,modelLeaf,modelErr,ops=(1,20)) # 利用訓(xùn)練數(shù)據(jù)構(gòu)造回歸樹(shù) yHat=createForeCast(myTree, testMat[:,0],modelTreeEval) coefficient_modeltree=corrcoef(yHat,testMat[:,1],rowvar=0)[0,1]print 'regtree coefficient:',coefficient_regtree print 'modeltree coefficient:',coefficient_modeltree

運(yùn)行結(jié)果:

... regtree coefficient: 0.964085231822 modeltree coefficient: 0.976041219138

我們知道,R2的值越接近1.0越好,所以從上面的結(jié)果可以看出模型樹(shù)的結(jié)果比回歸樹(shù)的要好,而線性回歸的效果還不如回歸樹(shù)。

總結(jié)

以上是生活随笔為你收集整理的树回归源码分析(1)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。