日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

LogisticRegression针对中文文本分类

發布時間:2023/12/20 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 LogisticRegression针对中文文本分类 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

改編自博客:

http://blog.csdn.net/github_36326955/article/details/54891204


做個筆記

代碼按照1 2 3 4的順序進行即可:

1.py(corpus_segment.py)

[python] view plaincopy
  • #!/usr/bin/env?python??
  • #?-*-?coding:?UTF-8?-*-??
  • """?
  • @version:?python2.7.8??
  • @author:?XiangguoSun?
  • @contact:?sunxiangguodut@qq.com?
  • @file:?corpus_segment.py?
  • @time:?2017/2/5?15:28?
  • @software:?PyCharm?
  • """??
  • import?sys??
  • import?os??
  • import?jieba??
  • #?配置utf-8輸出環境??
  • reload(sys)??
  • sys.setdefaultencoding('utf-8')??
  • #?保存至文件??
  • def?savefile(savepath,?content):??
  • ????with?open(savepath,?"wb")?as?fp:??
  • ????????fp.write(content)??
  • ????'''''?
  • ????上面兩行是python2.6以上版本增加的語法,省略了繁瑣的文件close和try操作?
  • ????2.5版本需要from?__future__?import?with_statement?
  • ????新手可以參考這個鏈接來學習http://zhoutall.com/archives/325?
  • ????'''??
  • #?讀取文件??
  • def?readfile(path):??
  • ????with?open(path,?"rb")?as?fp:??
  • ????????content?=?fp.read()??
  • ????return?content??
  • ??
  • def?corpus_segment(corpus_path,?seg_path):??
  • ????'''''?
  • ????corpus_path是未分詞語料庫路徑?
  • ????seg_path是分詞后語料庫存儲路徑?
  • ????'''??
  • ????catelist?=?os.listdir(corpus_path)??#?獲取corpus_path下的所有子目錄??
  • ????'''''?
  • ????其中子目錄的名字就是類別名,例如:?
  • ????train_corpus/art/21.txt中,'train_corpus/'是corpus_path,'art'是catelist中的一個成員?
  • ????'''??
  • ??
  • ????#?獲取每個目錄(類別)下所有的文件??
  • ????for?mydir?in?catelist:??
  • ????????'''''?
  • ????????這里mydir就是train_corpus/art/21.txt中的art(即catelist中的一個類別)?
  • ????????'''??
  • ????????class_path?=?corpus_path?+?mydir?+?"/"??#?拼出分類子目錄的路徑如:train_corpus/art/??
  • ????????seg_dir?=?seg_path?+?mydir?+?"/"??#?拼出分詞后存貯的對應目錄路徑如:train_corpus_seg/art/??
  • ??
  • ????????if?not?os.path.exists(seg_dir):??#?是否存在分詞目錄,如果沒有則創建該目錄??
  • ????????????os.makedirs(seg_dir)??
  • ??
  • ????????file_list?=?os.listdir(class_path)??#?獲取未分詞語料庫中某一類別中的所有文本??
  • ????????'''''?
  • ????????train_corpus/art/中的?
  • ????????21.txt,?
  • ????????22.txt,?
  • ????????23.txt?
  • ????????...?
  • ????????file_list=['21.txt','22.txt',...]?
  • ????????'''??
  • ????????for?file_path?in?file_list:??#?遍歷類別目錄下的所有文件??
  • ????????????fullname?=?class_path?+?file_path??#?拼出文件名全路徑如:train_corpus/art/21.txt??
  • ????????????content?=?readfile(fullname)??#?讀取文件內容??
  • ????????????'''''此時,content里面存貯的是原文本的所有字符,例如多余的空格、空行、回車等等,?
  • ????????????接下來,我們需要把這些無關痛癢的字符統統去掉,變成只有標點符號做間隔的緊湊的文本內容?
  • ????????????'''??
  • ????????????content?=?content.replace("\r\n",?"")??#?刪除換行??
  • ????????????content?=?content.replace("?",?"")#刪除空行、多余的空格??
  • ????????????content_seg?=?jieba.cut(content)??#?為文件內容分詞??
  • ????????????savefile(seg_dir?+?file_path,?"?".join(content_seg))??#?將處理后的文件保存到分詞后語料目錄??
  • ??
  • ????print?"中文語料分詞結束!!!"??
  • ??
  • '''''?
  • 如果你對if?__name__=="__main__":這句不懂,可以參考下面的文章?
  • http://imoyao.lofter.com/post/3492bc_bd0c4ce?
  • 簡單來說如果其他python文件調用這個文件的函數,或者把這個文件作為模塊?
  • 導入到你的工程中時,那么下面的代碼將不會被執行,而如果單獨在命令行中?
  • 運行這個文件,或者在IDE(如pycharm)中運行這個文件時候,下面的代碼才會運行。?
  • 即,這部分代碼相當于一個功能測試。?
  • 如果你還沒懂,建議你放棄IT這個行業。?
  • '''??
  • if?__name__=="__main__":??
  • ????#對訓練集進行分詞??
  • ????corpus_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train/"??#?未分詞分類語料庫路徑??
  • ????seg_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/"??#?分詞后分類語料庫路徑,本程序輸出結果??
  • ????corpus_segment(corpus_path,seg_path)??
  • ??
  • ????#對測試集進行分詞??
  • ????corpus_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/answer/"??#?未分詞分類語料庫路徑??
  • ????seg_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/"??#?分詞后分類語料庫路徑,本程序輸出結果??
  • ????corpus_segment(corpus_path,seg_path)??

  • 2.py(corpus2Bunch.py) [python] view plaincopy
  • #!/usr/bin/env?python??
  • #?-*-?coding:?UTF-8?-*-??
  • """?
  • @version:?python2.7.8??
  • @author:?XiangguoSun?
  • @contact:?sunxiangguodut@qq.com?
  • @file:?corpus2Bunch.py?
  • @time:?2017/2/7?7:41?
  • @software:?PyCharm?
  • """??
  • import?sys??
  • reload(sys)??
  • sys.setdefaultencoding('utf-8')??
  • import?os#python內置的包,用于進行文件目錄操作,我們將會用到os.listdir函數??
  • import?cPickle?as?pickle#導入cPickle包并且取一個別名pickle??
  • '''''?
  • 事實上python中還有一個也叫作pickle的包,與這里的名字相同了,無所謂?
  • 關于cPickle與pickle,請參考博主另一篇博文:?
  • python核心模塊之pickle和cPickle講解?
  • http://blog.csdn.net/github_36326955/article/details/54882506?
  • 本文件代碼下面會用到cPickle中的函數cPickle.dump?
  • '''??
  • from?sklearn.datasets.base?import?Bunch??
  • #這個您無需做過多了解,您只需要記住以后導入Bunch數據結構就像這樣就可以了。??
  • #今后的博文會對sklearn做更有針對性的講解??
  • ??
  • ??
  • def?_readfile(path):??
  • ????'''''讀取文件'''??
  • ????#函數名前面帶一個_,是標識私有函數??
  • ????#?僅僅用于標明而已,不起什么作用,??
  • ????#?外面想調用還是可以調用,??
  • ????#?只是增強了程序的可讀性??
  • ????with?open(path,?"rb")?as?fp:#with?as句法前面的代碼已經多次介紹過,今后不再注釋??
  • ????????content?=?fp.read()??
  • ????return?content??
  • ??
  • def?corpus2Bunch(wordbag_path,seg_path):??
  • ????catelist?=?os.listdir(seg_path)#?獲取seg_path下的所有子目錄,也就是分類信息??
  • ????#創建一個Bunch實例??
  • ????bunch?=?Bunch(target_name=[],?label=[],?filenames=[],?contents=[])??
  • ????bunch.target_name.extend(catelist)??
  • ????'''''?
  • ????extend(addlist)是python?list中的函數,意思是用新的list(addlist)去擴充?
  • ????原來的list?
  • ????'''??
  • ????#?獲取每個目錄下所有的文件??
  • ????for?mydir?in?catelist:??
  • ????????class_path?=?seg_path?+?mydir?+?"/"??#?拼出分類子目錄的路徑??
  • ????????file_list?=?os.listdir(class_path)??#?獲取class_path下的所有文件??
  • ????????for?file_path?in?file_list:??#?遍歷類別目錄下文件??
  • ????????????fullname?=?class_path?+?file_path??#?拼出文件名全路徑??
  • ????????????bunch.label.append(mydir)??
  • ????????????bunch.filenames.append(fullname)??
  • ????????????bunch.contents.append(_readfile(fullname))??#?讀取文件內容??
  • ????????????'''''append(element)是python?list中的函數,意思是向原來的list中添加element,注意與extend()函數的區別'''??
  • ????#?將bunch存儲到wordbag_path路徑中??
  • ????with?open(wordbag_path,?"wb")?as?file_obj:??
  • ????????pickle.dump(bunch,?file_obj)??
  • ????print?"構建文本對象結束!!!"??
  • ??
  • if?__name__?==?"__main__":#這個語句前面的代碼已經介紹過,今后不再注釋??
  • ????#對訓練集進行Bunch化操作:??
  • ????wordbag_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"??#?Bunch存儲路徑,程序輸出??
  • ????seg_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/"??#?分詞后分類語料庫路徑,程序輸入??
  • ????corpus2Bunch(wordbag_path,?seg_path)??
  • ??
  • ????#?對測試集進行Bunch化操作:??
  • ????wordbag_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"??#?Bunch存儲路徑,程序輸出??
  • ????seg_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/"??#?分詞后分類語料庫路徑,程序輸入??
  • ????corpus2Bunch(wordbag_path,?seg_path)??


  • 3.py(TFIDF_space.py)

    [python] view plaincopy
  • #!/usr/bin/env?python??
  • #?-*-?coding:?UTF-8?-*-??
  • """?
  • @version:?python2.7.8??
  • @author:?XiangguoSun?
  • @contact:?sunxiangguodut@qq.com?
  • @file:?TFIDF_space.py?
  • @time:?2017/2/8?11:39?
  • @software:?PyCharm?
  • """??
  • import?sys??
  • reload(sys)??
  • sys.setdefaultencoding('utf-8')??
  • ??
  • from?sklearn.datasets.base?import?Bunch??
  • import?cPickle?as?pickle??
  • from?sklearn.feature_extraction.text?import?TfidfVectorizer??
  • ??
  • def?_readfile(path):??
  • ????with?open(path,?"rb")?as?fp:??
  • ????????content?=?fp.read()??
  • ????return?content??
  • ??
  • def?_readbunchobj(path):??
  • ????with?open(path,?"rb")?as?file_obj:??
  • ????????bunch?=?pickle.load(file_obj)??
  • ????return?bunch??
  • ??
  • def?_writebunchobj(path,?bunchobj):??
  • ????with?open(path,?"wb")?as?file_obj:??
  • ????????pickle.dump(bunchobj,?file_obj)??
  • ??
  • def?vector_space(stopword_path,bunch_path,space_path,train_tfidf_path=None):??
  • ??
  • ????stpwrdlst?=?_readfile(stopword_path).splitlines()??
  • ????bunch?=?_readbunchobj(bunch_path)??
  • ????tfidfspace?=?Bunch(target_name=bunch.target_name,?label=bunch.label,?filenames=bunch.filenames,?tdm=[],?vocabulary={})??
  • ??
  • ????if?train_tfidf_path?is?not?None:??
  • ????????trainbunch?=?_readbunchobj(train_tfidf_path)??
  • ????????tfidfspace.vocabulary?=?trainbunch.vocabulary??
  • ????????vectorizer?=?TfidfVectorizer(stop_words=stpwrdlst,?sublinear_tf=True,?max_df=0.5,vocabulary=trainbunch.vocabulary)??
  • ????????tfidfspace.tdm?=?vectorizer.fit_transform(bunch.contents)??
  • ??
  • ????else:??
  • ????????vectorizer?=?TfidfVectorizer(stop_words=stpwrdlst,?sublinear_tf=True,?max_df=0.5)??
  • ????????tfidfspace.tdm?=?vectorizer.fit_transform(bunch.contents)??
  • ????????tfidfspace.vocabulary?=?vectorizer.vocabulary_??
  • ??
  • ????_writebunchobj(space_path,?tfidfspace)??
  • ????print?"tf-idf詞向量空間實例創建成功!!!"??
  • ??
  • if?__name__?==?'__main__':??
  • ??
  • ????#?stopword_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#輸入的文件??
  • ????#?bunch_path?=?"train_word_bag/train_set.dat"#輸入的文件??
  • ????#?space_path?=?"train_word_bag/tfdifspace.dat"#輸出的文件??
  • ????#?vector_space(stopword_path,bunch_path,space_path)??
  • ????#??
  • ????#?bunch_path?=?"test_word_bag/test_set.dat"#輸入的文件??
  • ????#?space_path?=?"test_word_bag/testspace.dat"??
  • ????#?train_tfidf_path="train_word_bag/tfdifspace.dat"??
  • ????#?vector_space(stopword_path,bunch_path,space_path,train_tfidf_path)??
  • ??
  • ????stopword_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#輸入的文件??
  • ??
  • ????train_bunch_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"#輸入的文件??
  • ????space_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"#輸出的文件??
  • ????vector_space(stopword_path,train_bunch_path,space_path)??
  • ??
  • ????train_tfidf_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"??#?輸入的文件,由上面生成??
  • ????test_bunch_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"#輸入的文件??
  • ????test_space_path?=?"/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"#輸出的文件??
  • ??
  • ????vector_space(stopword_path,test_bunch_path,test_space_path,train_tfidf_path)?
  • 4.py

    #!/usr/bin/env python # -*- coding: UTF-8 -*- """ @version: python2.7.8 @author: XiangguoSun @contact: sunxiangguodut@qq.com @file: NBayes_Predict.py @time: 2017/2/8 12:21 @software: PyCharm """ import sys reload(sys) sys.setdefaultencoding('utf-8')import cPickle as pickle from sklearn.naive_bayes import MultinomialNB # 導入多項式貝葉斯算法# 讀取bunch對象 def _readbunchobj(path):with open(path, "rb") as file_obj:bunch = pickle.load(file_obj)return bunch# 導入訓練集 trainpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/tfidfspace.dat" train_set = _readbunchobj(trainpath)# 導入測試集 testpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_word_bag/testspace.dat" test_set = _readbunchobj(testpath)from sklearn.linear_model.logistic import LogisticRegression clf=LogisticRegression() clf.fit(train_set.tdm, train_set.label)# 預測分類結果 predicted = clf.predict(test_set.tdm)for flabel,file_name,expct_cate in zip(test_set.label,test_set.filenames,predicted):if flabel != expct_cate:print file_name,": 實際類別:",flabel," -->預測類別:",expct_cateprint "預測完畢!!!"# 計算分類精度: from sklearn import metrics def metrics_result(actual, predict):print '精度:{0:.3f}'.format(metrics.precision_score(actual, predict,average='weighted'))print '召回:{0:0.3f}'.format(metrics.recall_score(actual, predict,average='weighted'))print 'f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict,average='weighted'))metrics_result(test_set.label, predicted)
    依然使用復旦大學的新聞數據集

    運行結果(這里復制一部分):


    /home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics13.txt : 實際類別: C16-Electronics? -->預測類別: C19-Computer
    /home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics50.txt : 實際類別: C16-Electronics? -->預測類別: C34-Economy
    /home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics48.txt : 實際類別: C16-Electronics? -->預測類別: C34-Economy
    預測完畢!!!
    /home/appleyuchi/.virtualenvs/python2.7/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
    ? 'precision', 'predicted', average, warn_for)
    精度:0.860
    召回:0.885
    f1-score:0.862
    /home/appleyuchi/.virtualenvs/python2.7/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
    ? 'precision', 'predicted', average, warn_for)

    Process finished with exit code 0





    總結

    以上是生活随笔為你收集整理的LogisticRegression针对中文文本分类的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。