日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Adaboost方法分类新闻数据

發布時間:2025/4/5 编程问答 43 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Adaboost方法分类新闻数据 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用Adaboost方法,以一級決策樹樹樁(Stump)為基礎建立弱分類器,形式為:(feature, threshold, positive/negtive)。

設定的最大輪數為20,最終使用了16輪就使總誤差就降到了0,得到了準確率100%的分類器。

分類的數據包括:business,sports,auto三類。

由于寫的Adaboost是二分類模型,所以將語料分為business和非business兩類劃分,分別為1和-1。feature的選取就是分詞的結果,選擇兩個字以上的詞。threshold也只做是否在文檔中出現的0/1劃分。

最終結果如下,可以看出,選擇的特征是有重復的,如“體育”特征,被選擇了好幾次,每次的樹樁也相同。在中間訓練過程中,總體的分類正確率也是有波動和反復的。

從選擇的特征詞可以看出,前半部分的特征詞還比較靠譜,后半部分就不一定了,比如“編輯”,“第一”,“來源”什么的,可見模型是有一定的過擬合的。

result stumplist is: [(2404, 0, -1), (32590, 0, -1), (19569, 0, 1), (12171, 0, 1), (29965, 0, -1), (15667, 0, 1), (12171, 0, 1), (32687, 0, -1), (25944, 0, 1), (12171, 0, 1), (32890, 0, -1), (2404, 0, -1), (4840, 0, -1), (15667, 0, 1), (9642, 0, 1), (8630, 0, -1)] result features are: 財經 股票 汽車 體育 銀行 編輯 體育 作者 責任 體育 教育 財經 指出 編輯 第一 來源

單個樹樁的訓練過程中的輸出如下:

-------------------- Train stump round 9 -------------------- >>train featureindex is 0 get new min stump (0, 0, -1) feature is 石塊 , error is 0.408032946166 get new min stump (1, 0, -1) feature is 基建 , error is 0.402227800978 get new min stump (12, 0, -1) feature is 律師 , error is 0.397832761677 get new min stump (25, 0, -1) feature is 合理 , error is 0.385203075673 get new min stump (108, 0, -1) feature is 首席 , error is 0.382699443012 get new min stump (316, 0, -1) feature is 證券 , error is 0.344348559772 >>train featureindex is 1000 >>train featureindex is 2000 get new min stump (2258, 0, -1) feature is 政府 , error is 0.341484530448 get new min stump (2404, 0, -1) feature is 財經 , error is 0.27536370978 >>train featureindex is 3000 >>train featureindex is 4000 >>train featureindex is 5000 >>train featureindex is 6000 >>train featureindex is 7000 >>train featureindex is 8000 >>train featureindex is 9000 >>train featureindex is 10000 >>train featureindex is 11000 >>train featureindex is 12000 get new min stump (12171, 0, 1) feature is 體育 , error is 0.261865947444 >>train featureindex is 13000 >>train featureindex is 14000 >>train featureindex is 15000 >>train featureindex is 16000 >>train featureindex is 17000 >>train featureindex is 18000 >>train featureindex is 19000 >>train featureindex is 20000 >>train featureindex is 21000 >>train featureindex is 22000 >>train featureindex is 23000 >>train featureindex is 24000 >>train featureindex is 25000 >>train featureindex is 26000 >>train featureindex is 27000 >>train featureindex is 28000 >>train featureindex is 29000 >>train featureindex is 30000 >>train featureindex is 31000 >>train featureindex is 32000 >>train featureindex is 33000 >>train featureindex is 34000 >>train featureindex is 35000 >>train featureindex is 36000 >>train featureindex is 37000 >>train featureindex is 38000 this round stump is (12171, 0, 1) totallabelerror is 0.00293542074364
在寫Adaboost程序的時候,要注意一下,在每一輪中,訓練樹樁的時候,根據新的weightlist來訓練單個樹樁,此時計算單個樹樁的weightlist加權整體錯誤率,據此來選擇最好的樹樁;在本輪訓練樹樁結束后(即已選擇好本輪最好樹樁,跳出來進入下一輪訓練之前),需要計算整個Adaboost模型的分類錯誤率,此時需要將之前的所有樹樁的加權值(即cm值,或有的材料中叫alpha值)的預測結果和本輪樹樁的預測結果疊加,之后使用sign函數進行分類(之前的總的預測結果可以保留,這樣每次單獨加上本輪樹樁的更新值即可,無需再從頭計算),就得到了整體的分類正確率。這兩個分類正確率是不同的,分別是為了選擇最優樹樁和訓練整個Adaboost模型,前者只需要對新的weightlist進行加權即可,后者需要考慮之前得到的所有樹樁的預測值。我之前就弄混了,所以寫出來的總是預測的不對。

Python版的Adaboost算法代碼如下:

# /usr/bin/env python # -*- coding: utf-8 -*- from numpy import * import osdef getwordset(doclist):wordset = set(range(0))docwordset = []for doc in doclist:f = open(os.path.join(os.getcwd(), DIR, doc))content = f.read()words = content.split(' ')words = [word.strip() for word in words if len(word) > 4]wordset |= set(words)docwordset.append(list(set(words)))f.close()return list(wordset), docwordsetdef savefeaturelist(featurelist):f = open('featurelist','w')for feature in featurelist:f.write(feature + ' ')f.close()def classifydoclist(stump, doclist, weightlist):labellist = []for i in range(len(doclist)):#print 'classify doc',doclist[i]exist = 0words = docfeaturelist[i]if featurelist[stump[0]] in words: #Notice:'feature in words', NOT 'featureindex in words'exist = 1#append classify doc labelif exist == stump[1]:labellist.append(stump[2]) else:labellist.append(-1 * stump[2])return labellistdef trainstump(doclist, featurelist, weightlist):#stump is (featureindex, threshold, positive/negtive)minstump = (featurelist[0], 0, 0)minerror = 1.0minlabellist = []for featureindex in range(len(featurelist)):if featureindex % 1000 == 0: print '>>train featureindex is', featureindexfor threshold in [0, 1]:for symbol in [-1, 1]:stump = (featureindex, threshold, symbol)#print 'train stump',stumplabellist = classifydoclist(stump, doclist, weightlist)error = float(abs(array(doclabellist) - array(labellist))/2 * mat(weightlist).T)#print 'featureindex',featureindex,'error is',errorif error < minerror:minstump = stumpminerror = errorminlabellist = labellistprint 'get new min stump',stump,'feature is',featurelist[minstump[0]],', error is',errorif minerror == 0.0: breakreturn minstump, minerror, minlabellistdef getcm(error):error = max(error, 1e-16)return log((1.0-error)/error) #sometime this will muptiply 0.5.def updateweightlist(weightlist, cm, labellist):#print 'original weightlist is:\n',weightlist#minus = list(abs(array(doclabellist) - array(labellist))/2)minus = getclassifydiff(labellist)#print 'minus is:\n',minusweightlist = [weightlist[i] * exp(cm * minus[i]) for i in range(len(weightlist))]#print 'new weightlist0 is:\n',weightlistweightlist = [weight/sum(weightlist) for weight in weightlist]#print 'new weightlist is:\n',weightlistreturn weightlistdef sign(plist):result = [-1 for i in range(len(plist))]for i in range(len(plist)):if plist[i] > 0:result[i] = 1return resultdef getclassifydiff(plabellist):return list(abs(array(doclabellist) - array(plabellist))/2)def getclassifyerror(plabellist):#print 'predict labellist is',plabellistminus = getclassifydiff(plabellist)#print 'minus is',minusreturn 1.0 * minus.count(1) / len(plabellist)def traindata(doclist, featurelist):stumplist = []cmlist = []max_k = 20totallabelpredict = array([0.0 for i in range(len(doclabellist))])weightlist = [1.0/len(doclist) for i in range(len(doclist))]for i in range(max_k):print '\n','-' * 20,'Train stump round',i, '-' * 20#print 'new weightlist is',weightliststump, error, labellist = trainstump(doclist, featurelist, weightlist)print 'this round stump is',stumpcm = getcm(error)stumplist.append(stump)cmlist.append(cm)#check total predict result error.#print 'cm is',cmtotallabelpredict += cm * array(labellist)#print 'doclabellist is',doclabellist#print 'totallabelpredict is',totallabelpredicttotallabelerror = getclassifyerror(sign(totallabelpredict))print 'totallabelerror is',totallabelerror#print 'cm is',cmif totallabelerror == 0.0:break#update weight list.weightlist = updateweightlist(weightlist, cm, labellist)print '\n\nTrain data done!'#save model to filemodel = open('Adaboostmodel','w')model.write('cmlist:\n')model.write(str(cmlist)+'\n')model.write('stumplist:\n')model.write(str(stumplist)+'\n')model.write('stump features are:\n')print 'result stumplist is:', stumplistprint 'result features are:'for s in stumplist:print featurelist[s[0]]model.write(str(featurelist[s[0]])+'\n')model.close()print 'save model to file done!'def getdoclabellist(doclist):'''sports is -1, business is 1.two-class classify(business and not-business).'''labellist = [-1 for i in range(len(doclist))]for i in range(len(doclist)):if 'business' in doclist[i]:labellist[i] = 1return labellistdef adaboost():global DIRglobal doclist, featurelist, docfeaturelist, doclabellistDIR = 'news'print 'Arthur adaboost test begin...'print 'doc path DIR is:',DIRdoclist = os.listdir(os.path.join(os.getcwd(), DIR))doclist.sort()print 'total doc size:',len(doclist)#Get doc real label. train stump with this list!doclabellist = getdoclabellist(doclist)featurelist, docfeaturelist = getwordset(doclist)print 'total feature size:',len(featurelist)#train data to get stumps.traindata(doclist, featurelist)if __name__ == '__main__':adaboost()

Adaboost樹樁訓練完畢,之后在預測的時候直接使用cm作為每個樹樁的權重,之后對整體的預測結果使用sign函數即可進行分類預測。

總結

以上是生活随笔為你收集整理的Adaboost方法分类新闻数据的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 不卡影院一区二区 | 亚洲av中文无码乱人伦在线观看 | 精品无码人妻一区二区三 | 亚洲最大福利视频网 | 亚洲视频一二三区 | 久久国产色av免费观看 | 国产一毛片 | 四虎影视精品 | 91麻豆国产在线观看 | 奇米网888 | 中文字幕免费看 | 久久夜色网 | 国产精品福利导航 | 亚洲欧洲一区二区在线观看 | 打开免费观看视频在线 | 久久手机免费视频 | 在线看的免费网站 | 日韩人妻精品一区二区三区视频 | 久久人人草| 久久2019 | 韩日黄色 | 天天操天天艹 | 国语毛片| 在线观看你懂的网址 | 真实偷拍激情啪啪对白 | 99热首页 | 国产99精品 | 国产中文字幕第一页 | 中文字幕一区二区视频 | 欧美精品入口蜜桃 | 性做久久久久久免费观看欧美 | 久久久久女教师免费一区 | 国产精品一区二区入口九绯色 | 国产精品精品国产 | 日韩图色 | 欧美在线性 | 超碰网站在线观看 | 久久理论视频 | 欧美日韩在线中文字幕 | 色老头一区二区 | 特黄视频在线观看 | 国产乱了高清露脸对白 | 亚洲综合99 | 伊人久久一区二区 | 色又黄又爽 | 99久久精品国产一区二区成人 | 免费看h网站| 操操操爽爽爽 | www.97色| 西西人体做爰大胆gogo | 邻居少妇张开腿让我爽了在线观看 | 欧美在线视频免费观看 | 国产高清在线视频 | 国产在线啪 | 男人的亚洲天堂 | 亚洲久久久久久久 | 欧美激情15p | 影音先锋天堂网 | 性——交——性——乱免费的 | 日韩影院在线 | 欧美人狂配大交3d | 欧美粗大猛烈老熟妇 | 亚洲精品第二页 | 精品国产一区二 | 国产极品一区二区 | 精品国产18久久久久久二百 | 欧美日韩高清 | 91欧美激情一区二区三区 | 夜夜操影视 | 国产亚洲黄色片 | 欧美日韩综合一区二区三区 | 人人看人人草 | 医生强烈淫药h调教小说视频 | 探花视频在线免费观看 | 久久国精品 | 成人av在线一区二区 | 国产女上位 | 四虎国产精品永久在线国在线 | 欧洲金发美女大战黑人 | 1000部多毛熟女毛茸茸 | 久久爱影视i | 亚洲av综合色区无码另类小说 | 极品少妇一区二区三区 | 成人免费看片&#39; | 黄色xxxxx| 激情六月天| 国产乱人乱偷精品视频 | 影音先锋在线中文字幕 | 国产一级免费在线观看 | 欧美情侣性视频 | 中文字幕在线播放av | 亚洲少妇15p | av手机在线免费观看 | 国产高清av | 麻豆蜜桃91| av资源吧首页 | 影音先锋欧美在线 | 啦啦啦免费高清视频在线观看 | 国产亚洲精品久久久久婷婷瑜伽 |