當前位置：首頁 > 编程语言 > python >内容正文

python

python红楼梦词频统计_Python 红楼梦的字频与词频统计

發布時間：2024/3/7 python 50 豆豆

生活随笔收集整理的這篇文章主要介紹了 python红楼梦词频统计_Python 红楼梦的字频与词频统计小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python 紅樓夢的字頻與詞頻統計

使用jieba分詞工具分詞，在網上下載了一個停用詞表(包括標點符號)去除停用詞。

使用wordcloud生成詞云。

其實除了停用詞，程度詞與否定詞等也應該去除，但沒有找到合適的詞表。

字頻：

字符字頻

了 13993

的 9862

不 9439

一 8160

道 7271

來 7189

人 6626

我 6204

是 6145

說 6108

這 5204

他 5193

你 5004

兒 4756

著 4105

去 4006

個 3983

也 3933

子 3896

玉 3877

有 3738

寶 3733

又 3677

們 3321

笑 3285

里 3084

賈 3005

只 2962

那 2932

見 2930

上 2594

好 2555

姐 2504

太 2466

家 2447

在 2401

就 2368

么 2363

便 2247

大 2219

頭 2149

日 2113

聽 2054

出 2051

得 2019

老 1994

下 1945

要 1866

都 1844

回 1800

知 1762

二 1755

兩 1660

過 1635

還 1618

事 1580

自 1536

如 1536

話 1516

心 1502

小 1491

起 1483

看 1473

沒 1448

因 1440

母 1425

鳳 1407

忙 1400

之 1382

呢 1370

叫 1366

今 1353

面 1350

時 1344

些 1308

娘 1300

可 1292

等 1259

問 1242

奶 1195

什 1175

到 1165

才 1158

中 1135

吃 1123

和 1118

罷 1075

此 1034

打 1023

想 1008

夫 992

倒 989

正 950

別 949

給 937

黛 931

進 927

幾 913

詞頻：

DataFrame就不貼在這了

前80回：

后40回：

# 如果您需要使用此代碼，os.chdir路經需要指定到txt文本所在路徑

# 使用Zipin函數，需要txt有read()函數可以打開的正確的編碼格式

# 使用Cipin函數需要安裝jieba庫

# 使用word cloud函數需要安裝wordcloud與matplotlib庫

import os

import codecs

import jieba

import pandas as pd

from wordcloud import WordCloud

from scipy.misc import imread

import matplotlib.pyplot as plt

os.chdir("/Users/Zhaohaibo/Desktop")

class Hlm(object):

# —————————————————————

# Zipin(self, readdoc, writedoc)

# readdoc: 要讀取的文件名

# writedoc:要寫入的文件名

# output

# 字頻前100，并寫入writedoc

# —————————————————————

def Zipin(self, readdoc, writedoc):

word_lst = []

word_dict = {}

exclude_str = "，。！？、()【】<>《》=：+-*—“”…"

with open(readdoc,"r") as fileIn ,open(writedoc,'w') as fileOut:

# 添加每一個字到列表中

for line in fileIn:

for char in line:

word_lst.append(char)

# 用字典統計每個字出現的個數

for char in word_lst:

if char not in exclude_str:

if char.strip() not in word_dict: # strip去除各種空白

word_dict[char] = 1

else :

word_dict[char] += 1

# 排序x[1]是按字頻排序，x[0]則是按字排序

lstWords = sorted(word_dict.items(), key=lambda x:x[1], reverse=True)

# 輸出結果 (前100)

print ('字符\t字頻')

print ('=============')

for e in lstWords[:100]:

print ('%s\t%d' % e)

fileOut.write('%s, %d\n' % e)

# —————————————————————

# Cipin(self, doc)

# doc: 要讀取的文件名

# return:

# 詞頻表(DataFrame格式)

# —————————————————————

def Cipin(self, doc):

wdict = {}

f = open(doc,"r")

for line in f.readlines():

words = jieba.cut(line)

for w in words:

if(w not in wdict):

wdict[w] = 1

else:

wdict[w] += 1

# 導入停用詞表

stop = pd.read_csv('stoplist.txt', encoding = 'utf-8', sep = 'zhao', header = None,engine = 'python') #sep:分割符號(需要用一個確定不會出現在停用詞表中的單詞)

stop.columns = ['word']

stop = [' '] + list(stop.word) #python讀取時不會讀取到空格。但空格依舊需要去除。所以加上空格；讀取后的stop是series的結構，需要轉成列表

for i in range(len(stop)):

if(stop[i] in wdict):

wdict.pop(stop[i])

ind = list(wdict.keys())

val = list(wdict.values())

ind = pd.Series(ind)

val = pd.Series(val)

data = pd.DataFrame()

data['詞'] = ind

data['詞頻'] = val

return data

# —————————————————————

# Ciyun(self, doc)

# doc: 要讀取的文件名

# output:

# 詞云圖

# —————————————————————

def Ciyun(self,doc):

g = open(doc,"r").read()

back_pic = imread("aixin.jpg") # 設置背景圖片

wc = WordCloud( font_path='/System/Library/Fonts/STHeiti Medium.ttc',#設置字體

background_color="white", #背景顏色

max_words=2000,# 詞云顯示的最大詞數

mask=back_pic,#設置背景圖片

max_font_size=200, #字體最大值

random_state=42,

).generate(g)

plt.figure(figsize=(64,32))

plt.imshow(wc)

plt.axis('off')

plt.savefig("ciyun.jpg")

plt.show()

def main(self,readdoc):

# self.Zipin(readdoc,writedoc)

df = self.Cipin(readdoc)

#self.Ciyun(readdoc)

return df

if __name__ == '__main__':

hlm = Hlm()

hlm.Zipin("紅樓夢.txt","紅樓夢字頻.txt")

df_hlm1 = hlm.main("紅樓夢.txt")

總結

以上是生活随笔為你收集整理的python红楼梦词频统计_Python 红楼梦的字频与词频统计的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： GPS 卫星的信号结构
下一篇：激光测距仪构造原理及激光安全说明——TF