使用NLPIR 进行中文分词并标注词性
生活随笔
收集整理的這篇文章主要介紹了
使用NLPIR 进行中文分词并标注词性
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
背景
在許多時候為了更好的解析文本,我們不僅僅需要將文本分詞,去停這么簡單,除了獲取關鍵詞與新詞匯以外,我們還需要對獲取每個粒度的其他信息,比如詞性標注,在python中NLPIR就可以很好的完成這個任務,如果你沒有NLPIR那么你可以參考這篇文章NLPIR快速搭建,或者直接下載我已經準備好的漢語自然語言處理文件包NLP源碼集合
代碼,亦是我的筆記
# - * - coding: utf - 8 -*- # # 作者:田豐(FontTian) # 創建時間:'2017/7/3' # 郵箱:fonttian@Gmaill.com # CSDN:http://blog.csdn.net/fontthroneimport nltk import sys import nlpirsys.path.append("../")reload(sys) sys.setdefaultencoding('utf-8')import jieba from jieba import possegdef cutstrpos(txt):# 分詞+詞性cutstr = posseg.cut(txt)result = ""for word, flag in cutstr:result += word + "/" + flag + ' 'return resultdef cutstring(txt):# 分詞cutstr = jieba.cut(txt)result = " ".join(cutstr)return result# 讀取文件 txtfileobject = open('txt/nltest1.txt') textstr = "" try:filestr = txtfileobject.read() finally:txtfileobject.close()# 使用NLPIR2016 進行分詞 def ChineseWordsSegmentationByNLPIR2016(text):txt = nlpir.seg(text)seg_list = []for t in txt:seg_list.append(t[0].encode('utf-8'))return seg_liststopwords_path = 'stopwords\stopwords1893.txt' # 停用詞詞表# 去除停用詞 def ClearStopWordsWithListByNLPIR2016(seg_list):mywordlist = []liststr = "/ ".join(seg_list)f_stop = open(stopwords_path)try:f_stop_text = f_stop.read()f_stop_text = unicode(f_stop_text, 'utf-8')finally:f_stop.close()f_stop_seg_list = f_stop_text.split('\n')for myword in liststr.split('/'):if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1:mywordlist.append(myword)return ''.join(mywordlist)# print filestr filestr2 = ClearStopWordsWithListByNLPIR2016(ChineseWordsSegmentationByNLPIR2016(filestr)).replace(' ', '')# 中文分詞并標注詞性 posstr = cutstrpos(filestr2)print '**** show is end ****'print ' ' print 'This is posster' print posstrstrtag = [nltk.tag.str2tuple(word) for word in posstr.split()] # for item in strtag: # print item strsBySeg = nlpir.seg(filestr) strsBySeg2 = nlpir.seg(filestr2) strsByParagraphProcess = nlpir.ParagraphProcess(filestr, 1) strsByParagraphProcessA = nlpir.ParagraphProcessA(filestr, ChineseWordsSegmentationByNLPIR2016(filestr)[0], 1)print ' ' print ' ' print '**** strtag ****'for word, tag in strtag:print word, "/", tag, "|",print ' ' print ' ' print '**** strsBySeg ****' for word, tag in strsBySeg:print word, "/", tag, "|",print ' ' print ' ' print '**** strsBySeg2 ****' for word, tag in strsBySeg2:print word, "/", tag, "|",print ' ' print ' ' print '**** strsByParagraphProcess ****' print strsByParagraphProcess# print ' ' # print ' ' # print '**** strsByParagraphProcessA ****' # # for item in strsByParagraphProcessA: # print item,print ' ' print ' ' print '**** show is end ****實用示例
NLPIR會自動對人名進行分詞與標注,借助該功能我們可以獲取自定義新詞,或者提取與某類人有關的句子.下面是我前段時間在寫一個項目demon時剛寫的測試代碼
# - * - coding: utf - 8 -*- # # 作者:田豐(FontTian) # 創建時間:'2017/7/11' # 郵箱:fonttian@Gmaill.com # CSDN:http://blog.csdn.net/fontthrone from os import path from scipy.misc import imread import matplotlib.pyplot as plt import jieba from nlpir import * from wordcloud import WordCloud, ImageColorGenerator import sys reload(sys) sys.setdefaultencoding('utf-8') d = path.dirname(__file__)text = '接待鐘世鎮院士,籌備楊東奇部長接待事宜。' stopwords_path = 'stopwords\CNENstopwords.txt' # 停用詞詞表 number = 10def ShowByItem(List):print '********* show ', str(List), ' end *********'for item in List:print item,printprint '********* show ', str(List), ' end *********'# 使用NLPIR2016 獲取名字 def FindAcademicianNameByNLPIR2016(text,isAddYuanShi):txt = seg(text)seg_list = []for i in range(len(txt)):if txt[i][1] == 'nr' and txt[i+1][0] == '院士':if isAddYuanShi == 1:seg_list.append(txt[i][0].encode('utf-8')+'院士')else:seg_list.append(txt[i][0].encode('utf-8'))return seg_liststr2 = FindAcademicianNameByNLPIR2016(text,1)ShowByItem(str2)# 輸出 ********* show ['\xe9\x92\x9f\xe4\xb8\x96\xe9\x95\x87\xe9\x99\xa2\xe5\xa3\xab'] end 鐘世鎮院士 ********* show ['\xe9\x92\x9f\xe4\xb8\x96\xe9\x95\x87\xe9\x99\xa2\xe5\xa3\xab'] end在demon中使用的
使用NLPIR2016 獲取名字 def FindAcademicianNameByNLPIR2016(text,isAddYuanShi):txt = seg(text)seg_list = []for i in range(len(txt)):if txt[i][1] == 'nr' and txt[i+1][0] == '院士':if isAddYuanShi == 1:seg_list.append(txt[i][0].encode('utf-8')+'院士')else:seg_list.append(txt[i][0].encode('utf-8'))strAcademicianName = FindAcademicianNameByNLPIR2016(fullContent,1) strAcademicianName = list(set(strAcademicianName)) # 利用pandas存儲 dfAcademicianName = pd.DataFrame(strAcademicianName) dfAcademicianName.columns = ['AcademicianName'] dfAcademicianName.to_csv('csv/dfAcademicianName') # 利用Pandas 獲取 dfNewWords = pd.read_csv("csv/dfNewWords") dfAcademicianName = pd.read_csv("csv/dfAcademicianName")# 你也可以將其加入用戶新詞匯 # add_word(dfAcademicianName['AcademicianName'])# 提取所有帶有院士的報告 def GetAcademicianCSV(df,strColumn,df1):dfAcademicianName = pd.read_csv("csv/dfAcademicianName")listAcademicianName = list(dfAcademicianName['AcademicianName'])print type(listAcademicianName)mywordlistAcademicianName =[]mywordlisttime = []mywordAca = []df1 = df1.copy()numlen = len(df1.index)for i in range(numlen):for myword in df1.loc[i, strColumn].split():if (myword in listAcademicianName) and len(myword) > 1:print mywordmywordlistAcademicianName.append(df.loc[i, strColumn])mywordAca.append(myword)mywordlisttime.append(df.loc[i, 'time'])return mywordlistAcademicianName,mywordlisttime,mywordAca# 返回的信息 mywordlistAcademicianName, mywordlisttime,mywordAca = GetAcademicianCSV(df,'content',df1)效果如下
總結
以上是生活随笔為你收集整理的使用NLPIR 进行中文分词并标注词性的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 封装汉语自然语言处理中的常用方法(附代码
- 下一篇: Anaconda多环境多版本python