日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪(fǎng)問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测

發(fā)布時(shí)間:2023/12/20 编程问答 41 豆豆
生活随笔 收集整理的這篇文章主要介紹了 朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

轉(zhuǎn)自相國(guó)大人的博客,

http://blog.csdn.net/github_36326955/article/details/54891204

做個(gè)筆記

代碼按照1 2 3 4的順序進(jìn)行即可:

1.py(corpus_segment.py)

?

#!/usr/bin/env python # -*- coding: UTF-8 -*- """ @version: python2.7.8 @author: XiangguoSun @contact: sunxiangguodut@qq.com @file: corpus_segment.py @time: 2017/2/5 15:28 @software: PyCharm """ import sys import os import jieba # 配置utf-8輸出環(huán)境 reload(sys) sys.setdefaultencoding('utf-8') # 保存至文件 def savefile(savepath, content):with open(savepath, "wb") as fp:fp.write(content)'''上面兩行是python2.6以上版本增加的語(yǔ)法,省略了繁瑣的文件close和try操作2.5版本需要from __future__ import with_statement新手可以參考這個(gè)鏈接來(lái)學(xué)習(xí)http://zhoutall.com/archives/325''' # 讀取文件 def readfile(path):with open(path, "rb") as fp:content = fp.read()return contentdef corpus_segment(corpus_path, seg_path):'''corpus_path是未分詞語(yǔ)料庫(kù)路徑seg_path是分詞后語(yǔ)料庫(kù)存儲(chǔ)路徑'''catelist = os.listdir(corpus_path) # 獲取corpus_path下的所有子目錄'''其中子目錄的名字就是類(lèi)別名,例如:train_corpus/art/21.txt中,'train_corpus/'是corpus_path,'art'是catelist中的一個(gè)成員'''# 獲取每個(gè)目錄(類(lèi)別)下所有的文件for mydir in catelist:'''這里mydir就是train_corpus/art/21.txt中的art(即catelist中的一個(gè)類(lèi)別)'''class_path = corpus_path + mydir + "/" # 拼出分類(lèi)子目錄的路徑如:train_corpus/art/seg_dir = seg_path + mydir + "/" # 拼出分詞后存貯的對(duì)應(yīng)目錄路徑如:train_corpus_seg/art/if not os.path.exists(seg_dir): # 是否存在分詞目錄,如果沒(méi)有則創(chuàng)建該目錄os.makedirs(seg_dir)file_list = os.listdir(class_path) # 獲取未分詞語(yǔ)料庫(kù)中某一類(lèi)別中的所有文本'''train_corpus/art/中的21.txt,22.txt,23.txt...file_list=['21.txt','22.txt',...]'''for file_path in file_list: # 遍歷類(lèi)別目錄下的所有文件fullname = class_path + file_path # 拼出文件名全路徑如:train_corpus/art/21.txtcontent = readfile(fullname) # 讀取文件內(nèi)容'''此時(shí),content里面存貯的是原文本的所有字符,例如多余的空格、空行、回車(chē)等等,接下來(lái),我們需要把這些無(wú)關(guān)痛癢的字符統(tǒng)統(tǒng)去掉,變成只有標(biāo)點(diǎn)符號(hào)做間隔的緊湊的文本內(nèi)容'''content = content.replace("\r\n", "") # 刪除換行content = content.replace(" ", "")#刪除空行、多余的空格content_seg = jieba.cut(content) # 為文件內(nèi)容分詞savefile(seg_dir + file_path, " ".join(content_seg)) # 將處理后的文件保存到分詞后語(yǔ)料目錄print "中文語(yǔ)料分詞結(jié)束!!!"if __name__=="__main__":#對(duì)訓(xùn)練集進(jìn)行分詞corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train/" # 未分詞分類(lèi)語(yǔ)料庫(kù)路徑seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/" # 分詞后分類(lèi)語(yǔ)料庫(kù)路徑,本程序輸出結(jié)果corpus_segment(corpus_path,seg_path)#對(duì)測(cè)試集進(jìn)行分詞corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/answer/" # 未分詞分類(lèi)語(yǔ)料庫(kù)路徑seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/" # 分詞后分類(lèi)語(yǔ)料庫(kù)路徑,本程序輸出結(jié)果corpus_segment(corpus_path,seg_path)


2.py(corpus2Bunch.py)

?

?

#!/usr/bin/env python # -*- coding: UTF-8 -*- """ @version: python2.7.8 @author: XiangguoSun @contact: sunxiangguodut@qq.com @file: corpus2Bunch.py @time: 2017/2/7 7:41 @software: PyCharm """ import sys reload(sys) sys.setdefaultencoding('utf-8') import os#python內(nèi)置的包,用于進(jìn)行文件目錄操作,我們將會(huì)用到os.listdir函數(shù) import cPickle as pickle#導(dǎo)入cPickle包并且取一個(gè)別名pickle ''' 事實(shí)上python中還有一個(gè)也叫作pickle的包,與這里的名字相同了,無(wú)所謂 關(guān)于cPickle與pickle,請(qǐng)參考博主另一篇博文: python核心模塊之pickle和cPickle講解 http://blog.csdn.net/github_36326955/article/details/54882506 本文件代碼下面會(huì)用到cPickle中的函數(shù)cPickle.dump ''' from sklearn.datasets.base import Bunch #這個(gè)您無(wú)需做過(guò)多了解,您只需要記住以后導(dǎo)入Bunch數(shù)據(jù)結(jié)構(gòu)就像這樣就可以了。 #今后的博文會(huì)對(duì)sklearn做更有針對(duì)性的講解def _readfile(path):'''讀取文件'''#函數(shù)名前面帶一個(gè)_,是標(biāo)識(shí)私有函數(shù)# 僅僅用于標(biāo)明而已,不起什么作用,# 外面想調(diào)用還是可以調(diào)用,# 只是增強(qiáng)了程序的可讀性with open(path, "rb") as fp:#with as句法前面的代碼已經(jīng)多次介紹過(guò),今后不再注釋content = fp.read()return contentdef corpus2Bunch(wordbag_path,seg_path):catelist = os.listdir(seg_path)# 獲取seg_path下的所有子目錄,也就是分類(lèi)信息#創(chuàng)建一個(gè)Bunch實(shí)例bunch = Bunch(target_name=[], label=[], filenames=[], contents=[])bunch.target_name.extend(catelist)'''extend(addlist)是python list中的函數(shù),意思是用新的list(addlist)去擴(kuò)充原來(lái)的list'''# 獲取每個(gè)目錄下所有的文件for mydir in catelist:class_path = seg_path + mydir + "/" # 拼出分類(lèi)子目錄的路徑file_list = os.listdir(class_path) # 獲取class_path下的所有文件for file_path in file_list: # 遍歷類(lèi)別目錄下文件fullname = class_path + file_path # 拼出文件名全路徑bunch.label.append(mydir)bunch.filenames.append(fullname)bunch.contents.append(_readfile(fullname)) # 讀取文件內(nèi)容'''append(element)是python list中的函數(shù),意思是向原來(lái)的list中添加element,注意與extend()函數(shù)的區(qū)別'''# 將bunch存儲(chǔ)到wordbag_path路徑中with open(wordbag_path, "wb") as file_obj:pickle.dump(bunch, file_obj)print "構(gòu)建文本對(duì)象結(jié)束!!!"if __name__ == "__main__":#這個(gè)語(yǔ)句前面的代碼已經(jīng)介紹過(guò),今后不再注釋#對(duì)訓(xùn)練集進(jìn)行Bunch化操作:wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat" # Bunch存儲(chǔ)路徑,程序輸出seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/" # 分詞后分類(lèi)語(yǔ)料庫(kù)路徑,程序輸入corpus2Bunch(wordbag_path, seg_path)# 對(duì)測(cè)試集進(jìn)行Bunch化操作:wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat" # Bunch存儲(chǔ)路徑,程序輸出seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/" # 分詞后分類(lèi)語(yǔ)料庫(kù)路徑,程序輸入corpus2Bunch(wordbag_path, seg_path)

?

?

?

?

?

3.py(TFIDF_space.py)

?

#!/usr/bin/env python # -*- coding: UTF-8 -*- """ @version: python2.7.8 @author: XiangguoSun @contact: sunxiangguodut@qq.com @file: TFIDF_space.py @time: 2017/2/8 11:39 @software: PyCharm """ import sys reload(sys) sys.setdefaultencoding('utf-8')from sklearn.datasets.base import Bunch import cPickle as pickle from sklearn.feature_extraction.text import TfidfVectorizerdef _readfile(path):with open(path, "rb") as fp:content = fp.read()return contentdef _readbunchobj(path):with open(path, "rb") as file_obj:bunch = pickle.load(file_obj)return bunchdef _writebunchobj(path, bunchobj):with open(path, "wb") as file_obj:pickle.dump(bunchobj, file_obj)def vector_space(stopword_path,bunch_path,space_path,train_tfidf_path=None):stpwrdlst = _readfile(stopword_path).splitlines()bunch = _readbunchobj(bunch_path)tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[], vocabulary={})if train_tfidf_path is not None:trainbunch = _readbunchobj(train_tfidf_path)tfidfspace.vocabulary = trainbunch.vocabularyvectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5,vocabulary=trainbunch.vocabulary)tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)else:vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5)tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)tfidfspace.vocabulary = vectorizer.vocabulary__writebunchobj(space_path, tfidfspace)print "tf-idf詞向量空間實(shí)例創(chuàng)建成功!!!"if __name__ == '__main__':# stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#輸入的文件# bunch_path = "train_word_bag/train_set.dat"#輸入的文件# space_path = "train_word_bag/tfdifspace.dat"#輸出的文件# vector_space(stopword_path,bunch_path,space_path)## bunch_path = "test_word_bag/test_set.dat"#輸入的文件# space_path = "test_word_bag/testspace.dat"# train_tfidf_path="train_word_bag/tfdifspace.dat"# vector_space(stopword_path,bunch_path,space_path,train_tfidf_path)stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#輸入的文件train_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"#輸入的文件space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"#輸出的文件vector_space(stopword_path,train_bunch_path,space_path)train_tfidf_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat" # 輸入的文件,由上面生成test_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"#輸入的文件test_space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"#輸出的文件vector_space(stopword_path,test_bunch_path,test_space_path,train_tfidf_path)

?

?

?

?

?

4.py(NBayes_Predict.py)

?

#!/usr/bin/env python # -*- coding: UTF-8 -*- """ @version: python2.7.8 @author: XiangguoSun @contact: sunxiangguodut@qq.com @file: NBayes_Predict.py @time: 2017/2/8 12:21 @software: PyCharm """ import sys reload(sys) sys.setdefaultencoding('utf-8')import cPickle as pickle from sklearn.naive_bayes import MultinomialNB # 導(dǎo)入多項(xiàng)式貝葉斯算法# 讀取bunch對(duì)象 def _readbunchobj(path):with open(path, "rb") as file_obj:bunch = pickle.load(file_obj)return bunch# 導(dǎo)入訓(xùn)練集 trainpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat" train_set = _readbunchobj(trainpath)# 導(dǎo)入測(cè)試集 testpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat" test_set = _readbunchobj(testpath)# 訓(xùn)練分類(lèi)器:輸入詞袋向量和分類(lèi)標(biāo)簽,alpha:0.001 alpha越小,迭代次數(shù)越多,精度越高 clf = MultinomialNB(alpha=0.01).fit(train_set.tdm, train_set.label)# 預(yù)測(cè)分類(lèi)結(jié)果 predicted = clf.predict(test_set.tdm)for flabel,file_name,expct_cate in zip(test_set.label,test_set.filenames,predicted):if flabel != expct_cate:print file_name,": 實(shí)際類(lèi)別:",flabel," -->預(yù)測(cè)類(lèi)別:",expct_cateprint "預(yù)測(cè)完畢!!!"# 計(jì)算分類(lèi)精度: from sklearn import metrics def metrics_result(actual, predict):print '精度:{0:.3f}'.format(metrics.precision_score(actual, predict,average='weighted'))print '召回:{0:0.3f}'.format(metrics.recall_score(actual, predict,average='weighted'))print 'f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict,average='weighted'))metrics_result(test_set.label, predicted)

?

?

大概說(shuō)下用法:

一、上面四個(gè)代碼依次運(yùn)行即可

二、要注意數(shù)據(jù)的存放方式要和轉(zhuǎn)載的博客中一樣,文件夾的名字就是類(lèi)別名字,代碼會(huì)進(jìn)行自動(dòng)識(shí)別。

三、每次跑完一遍流程,跑下一次程序前,train_corpus_seg和test_corpus_seg兩個(gè)文件夾要全部刪除,不然上次殘留的結(jié)果會(huì)影響這次的預(yù)測(cè)。

同樣地,如果更換中文數(shù)據(jù)集,這兩個(gè)文件夾也要?jiǎng)h除,總之,運(yùn)行以上代碼的第一步驟就是檢查這兩個(gè)文件夾下面是不是空的。(當(dāng)然如果是第一次運(yùn)行以上四個(gè)代碼,沒(méi)有生成這兩個(gè)文件夾,自然是不用檢查的)

另外,他這篇博客的優(yōu)點(diǎn)是,可以針對(duì)小數(shù)據(jù)集(數(shù)據(jù)條數(shù)不到1000,十折交叉驗(yàn)證),預(yù)測(cè)概率可以達(dá)到60%~70%

?

程序之間的輸入輸出關(guān)系圖

?


?

總結(jié)

以上是生活随笔為你收集整理的朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。