當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

nlp中文文本摘要提取，快速提取文本主要意思

發布時間：2024/9/30 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 nlp中文文本摘要提取，快速提取文本主要意思小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文本摘要提取

之前寫過一版文本摘要提取，但那版并不完美。有所缺陷（但也獲得幾十次收藏）。
中文文本摘要提取（文本摘要提取有代碼）基于python

今天寫改進版的文本摘要提取。

文本摘要旨在將文本或文本集合轉換為包含關鍵信息的簡短摘要。文本摘要按照輸入類型可分為單文檔摘要和多文檔摘要。單文檔摘要從給定的一個文檔中生成摘要，多文檔摘要從給定的一組主題相關的文檔中生成摘要。按照輸出類型可分為抽取式摘要和生成式摘要。
摘要：意思就是從一段文本用幾句話來概括這段話的意思

范例文本

花果園中央商務區F4棟樓B33城，時尚星廊，市民來電反映在此店消費被欺騙，該店承諾消費不用錢但后來需要收錢，并且此處工作人員表示自己是著名發型師，市民表示向花果園工商局反映后工作人員不作為，并且自己有相關證據，但花果園工商局叫其自己進行協商，市民希望職能部門能夠盡快處理此問題，對此店的行為進行處罰并且將自己的錢退還，請職能部門及時處理。商家聯系方式：11111111111。

最后的結果圖

結果

{‘mean_scoredsenteces’: [‘花果園中央商務區F4棟樓B33城，’, ‘市民來電反映在此店消費被欺騙，’, ‘該店承諾消費不用錢但后來需要收錢，’, ‘市民表示向花果園工商局反映后工作人員不作為，’]}
{‘topnsenteces’: [‘花果園中央商務區F4棟樓B33城，’, ‘該店承諾消費不用錢但后來需要收錢，’, ‘市民表示向花果園工商局反映后工作人員不作為，’]}

實現過程

第一步：分句

#coding:utf-8 texts="""花果園中央商務區F4棟樓B33城，時尚星廊，市民來電反映在此店消費被欺騙，該店承諾消費不用錢但后來需要收錢，并且此處工作人員表示自己是著名發型師，市民表示向花果園工商局反映后工作人員不作為，并且自己有相關證據，但花果園工商局叫其自己進行協商，市民希望職能部門能夠盡快處理此問題，對此店的行為進行處罰并且將自己的錢退還，請職能部門及時處理。商家聯系方式：11111111111。"""def sent_tokenizer(texts):start=0i=0#每個字符的位置sentences=[]punt_list=',.!?:;~，。！？：；～'#標點符號for text in texts:#遍歷每一個字符if text in punt_list and token not in punt_list: #檢查標點符號下一個字符是否還是標點sentences.append(texts[start:i+1])#當前標點符號位置start=i+1#start標記到下一句的開頭i+=1else:i+=1#若不是標點符號，則字符位置繼續前移token=list(texts[start:i+2]).pop()#取下一個字符.pop是刪除最后一個if start<len(texts):sentences.append(texts[start:])#這是為了處理文本末尾沒有標點符號的情況return sentencessentence=sent_tokenizer(str(texts)) print(sentence)

第二步：加載停用詞
停用詞：見nlp 中文停用詞數據集

#停用詞 def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]return stopwords # 加載停用詞 stopwords = stopwordslist("停用詞.txt")

第三步：取出高頻詞
我這里取出的前20個高頻詞
針對不同長度的文本可以取更多的高頻詞，文本越長，需要的高頻詞越多

import nltk import jieba import numpystopwords = stopwordslist("停用詞.txt") sentence=sent_tokenizer(texts)#分句 words=[w for sentence in sentence for w in jieba.cut(sentence) if w not in stopwords if len(w)>1 and w!='\t']#詞語，非單詞詞，同時非符號 wordfre=nltk.FreqDist(words)#統計詞頻 topn_words=[w[0] for w in sorted(wordfre.items(),key=lambda d:d[1],reverse=True)][:20]#取出詞頻最高的20個單詞

第四步：核心代碼：給句子打分
思路

傳入參數：sentences，topn_words。sentences為分句文本，topn_words為高頻詞數組

準備好記錄每個分句的得分數組 scores=[] ，初始化句子標號-1 ：sentence_idx=-1

對每一個分句分詞。得到分句分詞數組

遍歷每一個分句分詞數組： for s in [list(jieba.cut(s)) for s in sentences]

句子索引+1。0表示第一個句子：sentence_idx+=1

準備好word_idx=[] ，用來存放高頻詞在當前分句中的索引位置

遍歷每一個高頻詞

記錄高頻詞在當前分句出現索引位置

對word_idx進行排序，得到的word_idx 類似1, 2, 3, 4, 5]或者[0, 1] 其中的0,1表示高頻詞在當前分句出現的索引位置

對得到的 word_idx 進行聚類，初始化clusters=[]， cluster=[word_idx[0]] ，clusters存放的是當前分句的所有簇， cluster存放的是一個簇。

對當前分句的word_idx 中的高頻詞索引進行遍歷，如果相鄰兩個高頻詞索引位置差小于閾值（我設為2，如果文本過長，閾值需增大），則這兩個詞是一類，添加進 cluster。

clusters添加cluster 。得到clusters類似[[0, 1, 2], [4, 5], [7]]。（當前分句高頻詞索引為012457）

遍歷clusters中的每一個簇，對每個簇進行打分，打分公式為：當前分句高頻詞總個數*當前分句高頻詞總個數/當前分句最后一個高頻詞與第一個高頻詞之間的距離

存放當前分句的最高分簇。

記錄當前分句的的得分，記錄格式（分句標號，簇的最高分）

重復步驟4-15 ，開始對下一個分句打分

返回 scores

def _score_sentences(sentences,topn_words):#參數 sentences：文本組（分好句的文本，topn_words：高頻詞組scores=[]sentence_idx=-1#初始句子索引標號-1for s in [list(jieba.cut(s)) for s in sentences]:# 遍歷每一個分句，這里的每個分句是分詞數組分句1類似 ['花', '果園', '中央商務區', 'F4', '棟樓', 'B33', '城', '，']sentence_idx+=1 #句子索引+1。。0表示第一個句子word_idx=[]#存放高頻詞在分句中的索引位置.得到結果類似：[1, 2, 3, 4, 5]，[0, 1]，[0, 1, 2, 4, 5, 7]..for w in topn_words:#遍歷每一個高頻詞try:word_idx.append(s.index(w))#高頻詞出現在該分句子中的索引位置except ValueError:#w不在句子中password_idx.sort()if len(word_idx)==0:continue#對于兩個連續的單詞，利用單詞位置索引，通過距離閥值計算族clusters=[] #存放的是幾個cluster。類似[[0, 1, 2], [4, 5], [7]]cluster=[word_idx[0]] #存放的是一個類別（簇）類似[0, 1, 2]i=1while i<len(word_idx):#遍歷當前分句中的高頻詞CLUSTER_THRESHOLD=2#舉例閾值我設為2if word_idx[i]-word_idx[i-1]<CLUSTER_THRESHOLD:#如果當前高頻詞索引與前一個高頻詞索引相差小于3，cluster.append(word_idx[i])#則認為是一類else:clusters.append(cluster[:])#將當前類別添加進clusters=[] cluster=[word_idx[i]] #新的類別i+=1clusters.append(cluster)#對每個族打分，每個族類的最大分數是對句子的打分max_cluster_score=0for c in clusters:#遍歷每一個簇significant_words_in_cluster=len(c)#當前簇的高頻詞個數total_words_in_cluster=c[-1]-c[0]+1#當前簇里最后一個高頻詞與第一個的距離 score=1.0*significant_words_in_cluster*significant_words_in_cluster/total_words_in_clusterif score>max_cluster_score:max_cluster_score=scorescores.append((sentence_idx,max_cluster_score))#存放當前分句的最大簇（說明下，一個分解可能有幾個簇）存放格式（分句索引，分解最大簇得分）return scores; scored_sentences=_score_sentences(sentence,topn_words) print(scored_sentences)

第五步：摘要提取
方法1：利用均值和標準差過濾非重要句子

avg=numpy.mean([s[1] for s in scored_sentences])#均值 std=numpy.std([s[1] for s in scored_sentences])#標準差 mean_scored=[(sent_idx,score) for (sent_idx,score) in scored_sentences if score>(avg+0.5*std)]#sent_idx 分句標號，score得分 c= dict(mean_scored_summary=[sentence[idx] for (idx,score) in mean_scored]) print(c) 方法2：返回得分最高的幾個句子 top_n_scored=sorted(scored_sentences,key=lambda s:s[1])[-3:]#對得分進行排序，取出3個句子 top_n_scored=sorted(top_n_scored,key=lambda s:s[0])#對得分最高的幾個分句，進行分句位置排序 c= dict(top_n_summary=[sentence[idx] for (idx,score) in top_n_scored]) print(c)

第五步：整理后的全文代碼

#coding:utf-8 import nltk import jieba import numpy#分句 def sent_tokenizer(texts):start=0i=0#每個字符的位置sentences=[]punt_list=',.!?:;~，。！？：；～'#標點符號for text in texts:#遍歷每一個字符if text in punt_list and token not in punt_list: #檢查標點符號下一個字符是否還是標點sentences.append(texts[start:i+1])#當前標點符號位置start=i+1#start標記到下一句的開頭i+=1else:i+=1#若不是標點符號，則字符位置繼續前移token=list(texts[start:i+2]).pop()#取下一個字符.pop是刪除最后一個if start<len(texts):sentences.append(texts[start:])#這是為了處理文本末尾沒有標點符號的情況return sentences#對停用詞加載 def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]return stopwords#對句子打分 def score_sentences(sentences,topn_words):#參數 sentences：文本組（分好句的文本，topn_words：高頻詞組scores=[]sentence_idx=-1#初始句子索引標號-1for s in [list(jieba.cut(s)) for s in sentences]:# 遍歷每一個分句，這里的每個分句是分詞數組分句1類似 ['花', '果園', '中央商務區', 'F4', '棟樓', 'B33', '城', '，']sentence_idx+=1 #句子索引+1。。0表示第一個句子word_idx=[]#存放關鍵詞在分句中的索引位置.得到結果類似：[1, 2, 3, 4, 5]，[0, 1]，[0, 1, 2, 4, 5, 7]..for w in topn_words:#遍歷每一個高頻詞try:word_idx.append(s.index(w))#關鍵詞出現在該分句子中的索引位置except ValueError:#w不在句子中password_idx.sort()if len(word_idx)==0:continue#對于兩個連續的單詞，利用單詞位置索引，通過距離閥值計算族clusters=[] #存放的是幾個cluster。類似[[0, 1, 2], [4, 5], [7]]cluster=[word_idx[0]] #存放的是一個類別（簇）類似[0, 1, 2]i=1while i<len(word_idx):#遍歷當前分句中的高頻詞CLUSTER_THRESHOLD=2#舉例閾值我設為2if word_idx[i]-word_idx[i-1]<CLUSTER_THRESHOLD:#如果當前高頻詞索引與前一個高頻詞索引相差小于3，cluster.append(word_idx[i])#則認為是一類else:clusters.append(cluster[:])#將當前類別添加進clusters=[]cluster=[word_idx[i]] #新的類別i+=1clusters.append(cluster)#對每個族打分，每個族類的最大分數是對句子的打分max_cluster_score=0for c in clusters:#遍歷每一個簇significant_words_in_cluster=len(c)#當前簇的高頻詞個數total_words_in_cluster=c[-1]-c[0]+1#當前簇里最后一個高頻詞與第一個的距離score=1.0*significant_words_in_cluster*significant_words_in_cluster/total_words_in_clusterif score>max_cluster_score:max_cluster_score=scorescores.append((sentence_idx,max_cluster_score))#存放當前分句的最大簇（說明下，一個分解可能有幾個簇）存放格式（分句索引，分解最大簇得分）return scores;#結果輸出 def results(texts,topn_wordnum,n):#texts 文本，topn_wordnum高頻詞個數,為返回幾個句子stopwords = stopwordslist("停用詞.txt")#加載停用詞sentence = sent_tokenizer(texts) # 分句words = [w for sentence in sentence for w in jieba.cut(sentence) if w not in stopwords iflen(w) > 1 and w != '\t'] # 詞語，非單詞詞，同時非符號wordfre = nltk.FreqDist(words) # 統計詞頻topn_words = [w[0] for w in sorted(wordfre.items(), key=lambda d: d[1], reverse=True)][:topn_wordnum] # 取出詞頻最高的topn_wordnum個單詞scored_sentences = score_sentences(sentence, topn_words)#給分句打分# 1,利用均值和標準差過濾非重要句子avg = numpy.mean([s[1] for s in scored_sentences]) # 均值std = numpy.std([s[1] for s in scored_sentences]) # 標準差mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences ifscore > (avg + 0.5 * std)] # sent_idx 分句標號，score得分# 2，返回top n句子top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-n:] # 對得分進行排序，取出n個句子top_n_scored = sorted(top_n_scored, key=lambda s: s[0]) # 對得分最高的幾個分句，進行分句位置排序c = dict(mean_scoredsenteces=[sentence[idx] for (idx, score) in mean_scored])c1=dict(topnsenteces=[sentence[idx] for (idx, score) in top_n_scored])return c,c1if __name__=='__main__':texts = str(input('請輸入文本：'))topn_wordnum=int(input('請輸入高頻詞數：'))n=int(input('請輸入要返回的句子個數：'))c,c1=results(texts,topn_wordnum,n)print(c)print(c1)

簡單界面程序

封裝成簡單界面

超級簡單額界面程序，懶得寫太復雜的界面

#coding:utf-8 import nltk import jieba import numpy#分句 def sent_tokenizer(texts):start=0i=0#每個字符的位置sentences=[]punt_list=',.!?:;~，。！？：；～'#標點符號for text in texts:#遍歷每一個字符if text in punt_list and token not in punt_list: #檢查標點符號下一個字符是否還是標點sentences.append(texts[start:i+1])#當前標點符號位置start=i+1#start標記到下一句的開頭i+=1else:i+=1#若不是標點符號，則字符位置繼續前移token=list(texts[start:i+2]).pop()#取下一個字符.pop是刪除最后一個if start<len(texts):sentences.append(texts[start:])#這是為了處理文本末尾沒有標點符號的情況return sentences#對停用詞加載 def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]return stopwords#對句子打分 def score_sentences(sentences,topn_words):#參數 sentences：文本組（分好句的文本，topn_words：高頻詞組scores=[]sentence_idx=-1#初始句子索引標號-1for s in [list(jieba.cut(s)) for s in sentences]:# 遍歷每一個分句，這里的每個分句是分詞數組分句1類似 ['花', '果園', '中央商務區', 'F4', '棟樓', 'B33', '城', '，']sentence_idx+=1 #句子索引+1。。0表示第一個句子word_idx=[]#存放關鍵詞在分句中的索引位置.得到結果類似：[1, 2, 3, 4, 5]，[0, 1]，[0, 1, 2, 4, 5, 7]..for w in topn_words:#遍歷每一個高頻詞try:word_idx.append(s.index(w))#關鍵詞出現在該分句子中的索引位置except ValueError:#w不在句子中password_idx.sort()if len(word_idx)==0:continue#對于兩個連續的單詞，利用單詞位置索引，通過距離閥值計算族clusters=[] #存放的是幾個cluster。類似[[0, 1, 2], [4, 5], [7]]cluster=[word_idx[0]] #存放的是一個類別（簇）類似[0, 1, 2]i=1while i<len(word_idx):#遍歷當前分句中的高頻詞CLUSTER_THRESHOLD=2#舉例閾值我設為2if word_idx[i]-word_idx[i-1]<CLUSTER_THRESHOLD:#如果當前高頻詞索引與前一個高頻詞索引相差小于3，cluster.append(word_idx[i])#則認為是一類else:clusters.append(cluster[:])#將當前類別添加進clusters=[]cluster=[word_idx[i]] #新的類別i+=1clusters.append(cluster)#對每個族打分，每個族類的最大分數是對句子的打分max_cluster_score=0for c in clusters:#遍歷每一個簇significant_words_in_cluster=len(c)#當前簇的高頻詞個數total_words_in_cluster=c[-1]-c[0]+1#當前簇里最后一個高頻詞與第一個的距離score=1.0*significant_words_in_cluster*significant_words_in_cluster/total_words_in_clusterif score>max_cluster_score:max_cluster_score=scorescores.append((sentence_idx,max_cluster_score))#存放當前分句的最大簇（說明下，一個分解可能有幾個簇）存放格式（分句索引，分解最大簇得分）return scores;#結果輸出 def results(texts,topn_wordnum,n):#texts 文本，topn_wordnum高頻詞個數,為返回幾個句子stopwords = stopwordslist("停用詞.txt")#加載停用詞sentence = sent_tokenizer(texts) # 分句words = [w for sentence in sentence for w in jieba.cut(sentence) if w not in stopwords iflen(w) > 1 and w != '\t'] # 詞語，非單詞詞，同時非符號wordfre = nltk.FreqDist(words) # 統計詞頻topn_words = [w[0] for w in sorted(wordfre.items(), key=lambda d: d[1], reverse=True)][:topn_wordnum] # 取出詞頻最高的topn_wordnum個單詞scored_sentences = score_sentences(sentence, topn_words)#給分句打分# 1,利用均值和標準差過濾非重要句子avg = numpy.mean([s[1] for s in scored_sentences]) # 均值std = numpy.std([s[1] for s in scored_sentences]) # 標準差mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences ifscore > (avg + 0.5 * std)] # sent_idx 分句標號，score得分# 2，返回top n句子top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-n:] # 對得分進行排序，取出n個句子top_n_scored = sorted(top_n_scored, key=lambda s: s[0]) # 對得分最高的幾個分句，進行分句位置排序c = dict(mean_scoredsenteces=[sentence[idx] for (idx, score) in mean_scored])c1=dict(topnsenteces=[sentence[idx] for (idx, score) in top_n_scored])return c,c1from PyQt5.QtWidgets import QApplication, QWidget, QTextEdit, QVBoxLayout, QPushButton,QLabel,QLineEdit,QFormLayout import sysclass TextEditDemo(QWidget):def __init__(self, parent=None):super(TextEditDemo, self).__init__(parent)self.setWindowTitle("中文摘要提取")self.resize(500, 570)self.label1 = QLabel('輸入文本')self.textEdit1 = QTextEdit()self.lineedit1 = QLineEdit()#請輸入高頻詞數self.lineedit2 = QLineEdit()#請輸入返回句子數self.btnPress1 = QPushButton("點擊運行")self.textEdit2 = QTextEdit()#方法1顯示self.textEdit3 = QTextEdit()#方法2 顯示flo = QFormLayout()#表單布局flo.addRow("請輸入高頻詞數:", self.lineedit1)flo.addRow("請輸入返回句子數:", self.lineedit2)layout = QVBoxLayout()layout.addWidget(self.label1)layout.addWidget(self.textEdit1)layout.addLayout(flo)layout.addWidget(self.btnPress1)layout.addWidget(self.textEdit2)layout.addWidget(self.textEdit3)self.setLayout(layout)self.btnPress1.clicked.connect(self.btnPress1_Clicked)def btnPress1_Clicked(self):try:text = self.textEdit1.toPlainText() # 返回輸入的文本topn_wordnum = int(self.lineedit1.text()) # 高頻詞 20n = int(self.lineedit2.text()) # 3個返回句子c, c1 = results(str(text), topn_wordnum, n)self.textEdit2.setPlainText(str(c))self.textEdit2.setStyleSheet("font:10pt '楷體';border-width:5px;border-style: inset;border-color:gray")self.textEdit3.setPlainText(str(c1))self.textEdit3.setStyleSheet("font:10pt '楷體';border-width:5px;border-style: inset;border-color:red")except:self.textEdit2.setPlainText('操作失誤')self.lineedit1.setText('操作失誤，請輸入整數')self.lineedit2.setText('操作失誤，請輸入整數')if __name__ == "__main__":app = QApplication(sys.argv)win = TextEditDemo()win.show()sys.exit(app.exec_())

作者：電氣余登武

總結

以上是生活随笔為你收集整理的nlp中文文本摘要提取，快速提取文本主要意思的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python matplotlib 绘图
下一篇： NLP分析小说人物关系，找找主人公的真爱