日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

模型开发-GBDT决策树模型开发代码

發(fā)布時(shí)間:2025/3/21 编程问答 43 豆豆
生活随笔 收集整理的這篇文章主要介紹了 模型开发-GBDT决策树模型开发代码 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

?

? ? ? ?GBDT(Gradient Boosting Decision Tree) 又叫 MART(Multiple Additive Regression Tree),是一種迭代的決策樹(shù)算法,該算法由多棵決策樹(shù)組成,它在被提出之初就和SVM一起被認(rèn)為是泛化能力(generalization)較強(qiáng)的算法。近些年更因?yàn)楸挥糜诟鞔髷?shù)據(jù)競(jìng)賽而引起大家關(guān)注,本文開(kāi)發(fā)調(diào)試了基本的模型開(kāi)發(fā)代碼共大家學(xué)習(xí)交流。

文件可以在這里下載調(diào)試:https://download.csdn.net/download/iqdutao/12676802

需要的文件及依賴文件列表為:

000000_train數(shù)據(jù)集示例如下:

?

一、GBDT決策樹(shù)模型訓(xùn)練

1. 導(dǎo)入相關(guān)的依賴包

from sklearn.ensemble import GradientBoostingClassifier #分類模型 from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.metrics import confusion_matrix from sklearn.externals import joblib from sklearn import metrics import numpy as np import random import os

2. 外部數(shù)據(jù)讀取

def sampleDataFunction(fileName,sampleLines,seed):numFeat = len(open(fileName).readline().split(',')) # 計(jì)算有多少列l(wèi)abelMat = []countSum = 0print("總共有多少列: %.4f" % numFeat)with open(fileName, 'r') as finput:for line in finput: # 遍歷原始數(shù)據(jù)集每一行countSum = countSum + 1#print(countSum)labelMat.append(line) # 再傳進(jìn)dataMat列表向量random.seed(seed)if sampleLines == all:sampleLines= countSumsampleData = random.sample(labelMat, sampleLines)return sampleDatasampleData = sampleDataFunction('000000_train',all,8)

外部數(shù)據(jù)讀取函數(shù) sampleDataFunction(fileName, sampleLines, seed)
其中fileName讀取文件的路徑,CSV和TXT格式都可以,要求里面的數(shù)據(jù)是以“,”分割,window下的文件路徑需要替換為反斜杠;
其中sampleLines指取多少行文件里面的數(shù)據(jù),當(dāng) “sampleLines=all” 時(shí),表示取全部的文件內(nèi)容;當(dāng) “sampleLines=10” 時(shí),表示隨機(jī)取文件中的10行內(nèi)容,
其中seed表示隨機(jī)種子;

3. 數(shù)據(jù)提取

def loadDataSet(slice):numFeat = len(slice[1].split(',') ) # 計(jì)算有多少列dataMat = []labelMat = []countSum = 1print("總共有多少列: %.4f" % numFeat)print("總共有多少行: %.4f" % len(slice))for line in range(len(slice)):countSum = countSum + 1# print(countSum)lineArr =[]curLine = slice[line].split(',') # 是一列表類型for i in range(2,numFeat-1): # numFeat - 1的原因:因?yàn)樵紨?shù)據(jù)的倒數(shù)第1列是類別,不是屬性數(shù)據(jù),從第2列到倒數(shù)第2列為變量lineArr.append(float(curLine[i])) # 一個(gè)一個(gè)傳進(jìn)lineArr列表向量dataMat.append(lineArr) # 再傳進(jìn)dataMat列表向量labelMat.append(float(curLine[-1])) # 寫(xiě)進(jìn)標(biāo)簽列表,最后一列為標(biāo)簽列,-1表示最后一列return dataMat, labelMatdataMat, labelMat = loadDataSet(sampleData)

數(shù)據(jù)提取函數(shù)dataMat, labelMat = loadDataSet(sampleData)
其中sampleData是上面函數(shù)的輸出;其中dataMat, labelMat分別表示數(shù)據(jù)的變量部分和標(biāo)簽部分
其中for i in range(2, numFeat - 1)表示讀取每一行的變量,需要指定每一行第幾列-第幾列為變量,其中第2列-倒數(shù)第二列為變量
其中l(wèi)abelMat.append(float(curLine[-1])),-1 表示在每一行數(shù)據(jù)中,標(biāo)簽所在的位置,-1 表示最后一列,

4. 數(shù)據(jù)集的劃分

#x為數(shù)據(jù)集的feature熟悉,y為label. x_train, x_test, y_train, y_test = train_test_split(dataMat, labelMat, test_size = 0.3, random_state=1)

其中 test_size = 0.3 表示劃分的數(shù)據(jù)中,訓(xùn)練集占比70%,驗(yàn)證集占比30%;

5. 模型訓(xùn)練參數(shù)

#配置GBDT參數(shù)及調(diào)優(yōu) gbdt = GradientBoostingClassifier(loss='deviance',learning_rate=0.01, #學(xué)習(xí)速率n_estimators=80, #訓(xùn)練迭代次數(shù)subsample=0.6, #每次訓(xùn)練隨機(jī)抽取的樣本集大小為60%max_features= 'sqrt',max_depth=6,verbose = 2 )gbdt_model = gbdt.fit(x_train, y_train) #模型訓(xùn)練開(kāi)始

6. 特征權(quán)重排序

#特征值權(quán)重排序 #feat_labels = x_train.columns[2:] #特征列名 importances = gbdt_model.feature_importances_ #feature_importances_特征列重要性占比 indices = np.argsort(importances)[::-1] #對(duì)參數(shù)從小到大排序的索引序號(hào)取逆,即最重要特征索引——>最不重要特征索引 print(indices) print(len(indices))for f in range(len(indices)):print("%2d) %-*s %f" % (f, 30, indices[f], importances[indices[f]]))

GBDT模型在訓(xùn)練完成后,通過(guò)gbdt_model.feature_importances_輸出每個(gè)變量對(duì)模型影響的權(quán)重,通過(guò)權(quán)重剔除一些無(wú)效變量

7. 模型保存

#保存模型 if os.path.exists("model_train.m"):os.remove("model_train.m")print("model_train.m 存在,并且刪除") joblib.dump(gbdt_model, "model_train.m")

8. 模型評(píng)估指標(biāo)輸出

#利用訓(xùn)練的模型來(lái)測(cè)試 訓(xùn)練集 prediction_train = gbdt_model.predict(x_train) prediction_train_predprob = gbdt_model.predict_proba(x_train)[:,1]# #利用訓(xùn)練的模型來(lái)測(cè)試 驗(yàn)證集集 prediction_test = gbdt_model.predict(x_test) prediction_test_predprob = gbdt_model.predict_proba(x_test)[:,1]print("總樣本: %.4f" % len(dataMat)) print("訓(xùn)練樣本: %.4f" % len(x_train)) print("驗(yàn)證樣本: %.4f" % len(x_test)) print("正樣本: %.4f" % labelMat.count(1)) print("負(fù)樣本: %.4f" % labelMat.count(0))print("------------------------------------------訓(xùn)練集指標(biāo)") testAccuracy = metrics.accuracy_score(y_train, prediction_train) print("訓(xùn)練樣本的準(zhǔn)確率: %.4f" % testAccuracy)average_precision = metrics.average_precision_score(y_train, prediction_train) print("average_precision: %.4f" % average_precision)#查準(zhǔn)率 averagePrecisionScore = metrics.precision_score(y_train, prediction_train) print("查準(zhǔn)率: %.4f" % averagePrecisionScore)##召回率 returnResultScore = metrics.recall_score(y_train, prediction_train) print("召回率: %.4f" % returnResultScore)##F1值 F1Score = metrics.f1_score(y_train, prediction_train) print("F1-Score: %.4f" % F1Score)#計(jì)算auc roc_auc_predprob = metrics.roc_auc_score(y_train, prediction_train_predprob) print("AUC-Score_predprob: %.4f" % roc_auc_predprob)print("------------------------------------------驗(yàn)證集集指標(biāo)") testAccuracy = metrics.accuracy_score(y_test, prediction_test) print("驗(yàn)證樣本的準(zhǔn)確率: %.4f" % testAccuracy)#查準(zhǔn)率 averagePrecisionScore = metrics.precision_score(y_test, prediction_test) print("查準(zhǔn)率: %.4f" % averagePrecisionScore)##召回率 returnResultScore = metrics.recall_score(y_test, prediction_test) print("召回率: %.4f" % returnResultScore)##F1值 F1Score = metrics.f1_score(y_test, prediction_test) print("F1-Score: %.4f" % F1Score)#計(jì)算auc roc_auc = metrics.roc_auc_score(y_test, prediction_test_predprob) print("AUC-Score: %.4f" % roc_auc)

模型評(píng)估指標(biāo)直接調(diào)用metrics的相關(guān)包使用

9. 結(jié)果輸出

總共有多少列: 150.0000 總共有多少列: 150.0000 總共有多少行: 10000.0000Iter Train Loss OOB Improve Remaining Time 1 1.3449 0.0093 19.39s2 1.3363 0.0081 14.51s3 1.3314 0.0087 12.54s4 1.3184 0.0079 10.92s5 1.3138 0.0064 9.71s6 1.3052 0.0076 9.34s7 1.2981 0.0083 8.92s8 1.2913 0.0077 8.38s9 1.2828 0.0084 7.97s10 1.2724 0.0082 7.67s11 1.2650 0.0070 7.34s12 1.2606 0.0070 7.20s*****************************************************76 0.9596 0.0026 0.38s77 0.9539 0.0023 0.28s78 0.9500 0.0029 0.19s79 0.9472 0.0029 0.09s80 0.9477 0.0025 0.00s0) 44 0.1279631) 42 0.0961592) 30 0.0894363) 28 0.0750854) 16 0.0675785) 35 0.0462426) 14 0.041790*****************************************142) 12 0.000008 143) 108 0.000006 144) 6 0.000000 145) 26 0.000000 146) 40 0.000000總樣本: 10000.0000 訓(xùn)練樣本: 7000.0000 驗(yàn)證樣本: 3000.0000 正樣本: 3990.0000 負(fù)樣本: 6010.0000 -----------------------------------------訓(xùn)練集指標(biāo) 訓(xùn)練樣本的準(zhǔn)確率: 0.8774 average_precision: 0.7996 查準(zhǔn)率: 0.9364 召回率: 0.7447 F1-Score: 0.8296 AUC-Score_predprob: 0.9438 ------------------------------------------驗(yàn)證集集指標(biāo) 驗(yàn)證樣本的準(zhǔn)確率: 0.8553 查準(zhǔn)率: 0.9033 召回率: 0.7097 F1-Score: 0.7949 AUC-Score: 0.9327

?

二、GBDT決策樹(shù)模型預(yù)測(cè)

1.模型預(yù)測(cè)代碼

數(shù)據(jù)導(dǎo)入、數(shù)據(jù)提取模塊代碼沒(méi)有變動(dòng),增加模型讀取,測(cè)試集數(shù)據(jù)驗(yàn)證等代碼

from sklearn.ensemble import GradientBoostingClassifier #分類模型 from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.metrics import confusion_matrix from sklearn.externals import joblib from sklearn import metrics import numpy as np import random import osdef sampleDataFunction(fileName, sampleLines, seed):numFeat = len(open(fileName).readline().split(',')) # 計(jì)算有多少列l(wèi)abelMat = []countSum = 0print("總共有多少列: %.4f" % numFeat)with open(fileName, 'r') as finput:for line in finput: # 遍歷原始數(shù)據(jù)集每一行countSum = countSum + 1# print(countSum)labelMat.append(line) # 再傳進(jìn)dataMat列表向量random.seed(seed)if sampleLines == all:sampleLines = countSumsampleData = random.sample(labelMat, sampleLines)return sampleDatasampleData = sampleDataFunction('000000_test',all,8)def loadDataSet(slice):numFeat = len(slice[1].split(',')) # 計(jì)算有多少列dataMat = []labelMat = []countSum = 1print("總共有多少列: %.4f" % numFeat)print("總共有多少行: %.4f" % len(slice))for line in range(len(slice)):countSum = countSum + 1# print(countSum)lineArr = []curLine = slice[line].split(',') # 是一列表類型for i in range(2, numFeat - 1): # numFeat - 1的原因:因?yàn)樵紨?shù)據(jù)的倒數(shù)第1列是類別,不是屬性數(shù)據(jù)lineArr.append(float(curLine[i])) # 一個(gè)一個(gè)傳進(jìn)lineArr列表向量dataMat.append(lineArr) # 再傳進(jìn)dataMat列表向量labelMat.append(float(curLine[-1])) # 寫(xiě)進(jìn)標(biāo)簽列表return dataMat, labelMatdataMat, labelMat = loadDataSet(sampleData)#導(dǎo)入模型 gbdt_model = joblib.load("model_train.m") print(gbdt_model)# 利用訓(xùn)練的模型來(lái)測(cè)試 測(cè)試集 prediction_test = gbdt_model.predict(dataMat) prediction_test_predprob = gbdt_model.predict_proba(dataMat)[:, 1]print("------------------------------------------測(cè)試集指標(biāo)") testAccuracy = metrics.accuracy_score(labelMat, prediction_test) print("測(cè)試樣本的準(zhǔn)確率: %.4f" % testAccuracy)# 查準(zhǔn)率 averagePrecisionScore = metrics.precision_score(labelMat, prediction_test) print("查準(zhǔn)率: %.4f" % averagePrecisionScore)##召回率 returnResultScore = metrics.recall_score(labelMat, prediction_test) print("召回率: %.4f" % returnResultScore)##F1值 F1Score = metrics.f1_score(labelMat, prediction_test) print("F1-Score: %.4f" % F1Score)# 計(jì)算auc roc_auc = metrics.roc_auc_score(labelMat, prediction_test_predprob) print("AUC-Score: %.4f" % roc_auc)

2. 模型評(píng)價(jià)結(jié)果輸出

總共有多少列: 150.0000 總共有多少列: 150.0000 總共有多少行: 1000.0000 GradientBoostingClassifier(criterion='friedman_mse', init=None,learning_rate=0.01, loss='deviance', max_depth=6,max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=80,n_iter_no_change=None, presort='auto', random_state=None,subsample=0.6, tol=0.0001, validation_fraction=0.1,verbose=2, warm_start=False) ------------------------------------------測(cè)試集指標(biāo) 測(cè)試樣本的準(zhǔn)確率: 0.7920 查準(zhǔn)率: 0.8955 召回率: 0.5714 F1-Score: 0.6977 AUC-Score: 0.9055

?

?

總結(jié)

以上是生活随笔為你收集整理的模型开发-GBDT决策树模型开发代码的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。