當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLP中的红楼梦

發布時間：2024/3/7 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP中的红楼梦小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

兜兜轉轉學NLP學了一個月，結果還在皮毛上，今天打算使用NLP對自己喜歡的紅樓夢進行梳理。

這篇文章的目的，建立紅樓夢的知識庫

1、主要人物說話關鍵字提取

2、

一、建立語料庫

語料庫是以后我們分詞以及建立模型的基礎，我們將紅樓夢各章節的內容以一句話一行的形式建立語料庫。

└─data # 根目錄└─chapters # 存放文檔01.txt02.txt03.txt04.txt05.txt06.txt07.txt│└─corpus # 存放語料01.txt02.txt03.txt04.txt05.txt06.txt07.txt

#construct_corpus.py import re import matplotlib.pyplot as plt import pandas from itertools import chain #defaultdict的作用是在于，當字典里的key不存在但被查找時，返回的不是keyError而是一個默認值 from collections import defaultdict from string import punctuation# 定義要刪除的標點等字符 add_punc='，。、【】 “”：；（）《》‘’{}？！⑦()、%^>℃：.”“^-——=&#@￥『』' all_punc=punctuation+add_punc import os os.chdir('D:/good_study/NLP/紅樓夢/') chapters_path = 'D:/good_study/NLP/紅樓夢/data/chapters/' corpus_path = 'D:/good_study/NLP/紅樓夢/data/corpus/' #/*-----------------------------------------------*/ #/* 1、各章一句話一行的形式建立語料庫 #/*-----------------------------------------------*/ # 處理得到所有章節地址列表 listdir = os.listdir(chapters_path) # listdir=listdir[:9] #所有章節的每句話列表 sentences_all_list = [] for filename in listdir:print("正在處理第{}章節".format(filename))chapters_root_path = chapters_path + str(filename)#每個章節的每句話列表sentences_list = []with open(chapters_root_path,'r', encoding='utf8') as f:for line in f.readlines():# 把元素按照[。！；？]進行分隔，得到句子。line_split = re.split(r'[，。！；？]',line.strip())# [。！；？]這些符號也會劃分出來，把它們去掉。line_split = [line.strip() for line in line_split if line.strip() not in ['。','！','？','；'] and len(line.strip())>1]#移除英文和數字line_split = [re.sub(r'[A-Za-z0-9]|/d+','',line) for line in line_split]# #移除標點符號line_split = [''.join(list(filter(lambda ch: ch not in all_punc, line) )) for line in line_split]sentences_list.append(line_split)# print("="*30)#chain.from_iterable 將嵌套的列表無縫連接在一起sentences_list = list(chain.from_iterable(sentences_list))sentences_all_list.append(sentences_list)corpus_root_path = corpus_path + str(filename)with open(corpus_root_path,"w", encoding='utf8') as f:for line in sentences_list:f.write(line)f.write('\n')#構建全書語料庫 sentences_all_list = list(chain.from_iterable(sentences_all_list)) corpus_root_path=corpus_path+'whole_book.txt' with open(corpus_root_path,"w", encoding='utf8') as f:for line in sentences_all_list:f.write(line)f.write('\n')#/*-----------------------------------------------*/ #/* 2、分析各章字數 #/*-----------------------------------------------*/ # 處理得到所有章節地址列表 listdir = os.listdir(corpus_path) line_words_list=[] chapter_list=[] # listdir=listdir[:9] #所有章節的每句話列表 for filename in listdir:corpus_root_path = corpus_path + str(filename)#提取章節數字num = int(re.findall('\d+',filename)[0])chapter_list.append(num)with open(corpus_root_path,"r", encoding='utf8') as f:line_words=0for line in f.readlines():line_words+=len(line)line_words_list.append(line_words)print("{}章節，共{}字，驗證章節{}".format(filename,line_words,num))chapter_words=pandas.DataFrame({'chapter':chapter_list,'chapter_words':line_words_list})chapter_words.sort_values(by='chapter',ascending=True, inplace=True) chapter_words = chapter_words.set_index(keys=['chapter']) chapter_words['chapter_words'].plot(kind='bar',color = 'g',alpha = 0.5,figsize = (20,15)) plt.show()

處理好語料后，統計全書字數為82萬，各章節字數如下圖所示，每章平均字數在7000左右，字數和故事情節一樣，有抑揚頓挫的節奏感，中間57-78章節字數略有高峰，也是小說中寶黛愛情走向高峰、各種人物風波矛盾糾纏迭起的時候。

參考資料：點此鏈接

《紅樓夢》漢英平行語料庫：http://corpus.usx.edu.cn/hongloumeng/images/shiyongshuoming.htm

現代漢語＋古代漢語語料庫在線檢索系統:http://ccl.pku.edu.cn:8080/ccl_corpus/index.jsp?dir=xiandai

二、分詞，建立紅樓夢詞庫

分詞方法分規則分詞和統計分析，目前我們還沒有紅樓夢的詞庫，目前通用的漢語NLP工具均以現代漢語為核心語料，對古代漢語的處理效果很差，從網上找了甲言這個包，甲言，取「甲骨文言」之意，是一款專注于古漢語處理的NLP工具包。

當前版本支持詞庫構建、自動分詞、詞性標注、文言句讀和標點五項功能，更多功能正在開發中。

Windows上pip install kenlm報錯解決：點此鏈接

?

2.1 HMM

2.2 CRF

?

2.3 衡量分詞的一致性

三、命名實體識別
四、每章摘要
五、每章內容概述
六、每章內容標簽
七、紅樓夢的社交網絡
八、每章內容概述
九、每章內容概述
十、每章內容概述

未完待續...

總結

以上是生活随笔為你收集整理的NLP中的红楼梦的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：国家海洋局的超算应用探索
下一篇： RT-thread国产实时操作系统概述