日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

python统计三国_如何用python对《三国演义》、《红楼梦》等名著开展词云分析及字频统计、出场统计等工作。...

發(fā)布時間:2023/12/16 python 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python统计三国_如何用python对《三国演义》、《红楼梦》等名著开展词云分析及字频统计、出场统计等工作。... 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

以下以《紅樓夢》為例進行設(shè)計。

在制作詞云圖及統(tǒng)計之前,需要下載python的幾個庫,wordcloud、jieba以及imageio等,我的操作系統(tǒng)為Windows10,IDE環(huán)境為idle,下載方式就直接搜索cmd,打開命令提示符窗口,輸入pip install wordcloud等庫進行下載即可。像這樣,就下載成功了

要對名著進行開展,必不可少的就是這些名著的電子書,安裝好庫就要進行對電子書的下載,這個鏈接可以下載《紅樓夢》的txt電子書:紅樓夢txt下載|紅樓夢txt全集下載-紅樓夢百度云下載-TXT下載站?www.txtxzz.com這是我用到的背景圖

以下為我具體的操作代碼,具體的注釋我都加在了里面:

import jieba

import wordcloud

from imageio import imread

# 1、進行詞云分析,即詞云圖的制作

def ciyun():

mask = imread("林黛玉.png") # 打開詞云背景圖

tf = open('紅樓夢.txt','rt',encoding = 'utf-8') # 打開《林黛玉》txt文檔

txt = ''

for line in tf.readlines():

for j in ",.“”?:《》--!":

line.replace('',j)

txt += line

jieba_cut = jieba.lcut(txt) # 利用jieba對文檔進行全文分詞

c = wordcloud.WordCloud(width = 1200,

font_path = 'msyh.ttc',

height = 800,

background_color='white',

mask=mask) # 進行背景、畫布大小、顏色等處理

c.generate(' '.join(jieba_cut))

c.to_file('紅樓夢.png')

tf.close()

ciyun()

# 2、出場統(tǒng)計的制作

excludes = {"什么","一個","我們","那里","你們","如今","說道","知道","起來","姑娘","這里","出來","他們","眾人","自己",

"一面","只見","怎么","奶奶","兩個","沒有","不是","不知","這個","聽見","這樣","進來","咱們","告訴","就是",

"東西","襲人","回來","只是","大家","只得","老爺","丫頭","這些","不敢","出去","所以","不過","的話","不好",

"姐姐","探春","鴛鴦","一時","不能","過來","心里","如此","今日","銀子","幾個","答應","二人","還有","只管",

"這么","說話","一回","那邊","這話","外頭","打發(fā)","自然","今兒","罷了","屋里","那些","聽說","小丫頭","不用","如何"}

# 將這些會干擾的詞匯列出并且刪除,以免影響最后的結(jié)果

txt = open("紅樓夢.txt","r",encoding='utf-8').read() # 打開《紅樓夢》txt電子書

words = jieba.lcut(txt) # 利用jieba進行全文分詞

paixv = {}

for word in words:

if len(word) == 1: # 如果分割的長度是一,可能是語氣詞之類的,所以刪除

continue

else:

paixv[word] = paixv.get(word,0) + 1

for word in excludes:

del(paixv[word]) # 如果列出的干擾詞匯在分完詞后的所有詞匯中那么刪除

items = list(paixv.items()) # 將字典轉(zhuǎn)換為列表

items.sort(key=lambda x:x[1],reverse = True) # 將列表進行降序排列

for i in range(20): # 打印出前20個出場最多的人物名

word,count = items[i]

print("{0:<10}{1:>5}".format(word,count))

# 3、字頻統(tǒng)計的制作

import os

import codecs

import jieba

import pandas as pd

from wordcloud import WordCloud

from scipy.misc import imread

import matplotlib.pyplot as plt

os.chdir("/Users/Zhaohaibo/Desktop")

class Hlm(object):

def Zipin(self, readdoc, writedoc): # readdoc:要讀取的文件名,writedoc:要寫入的文件名

word_lst = []

word_dict = {}

exclude_str = ",。!?、()【】<>《》=:+-*—“”…"

with open(readdoc,"r") as fileIn ,open(writedoc,'w') as fileOut:

# 添加每一個字到列表中:

for line in fileIn:

for char in line:

word_lst.append(char)

# 用字典統(tǒng)計每個字出現(xiàn)的個數(shù):

for char in word_lst:

if char not in exclude_str:

if char.strip() not in word_dict: # strip去除各種空白

word_dict[char] = 1

else :

word_dict[char] += 1

# 排序x[1]是按字頻排序,x[0]則是按字排序

lstWords = sorted(word_dict.items(), key=lambda x:x[1], reverse=True)

# 輸出結(jié)果 (前100)

print ('字符\t字頻')

print ('=============')

for e in lstWords[:100]:

print ('%s\t%d' % e)

fileOut.write('%s, %d\n' % e)

# 詞頻表(DataFrame格式)

def Cipin(self, doc): # doc:要讀取的文件名

wdict = {}

f = open(doc,"r")

for line in f.readlines():

words = jieba.cut(line)

for w in words:

if(w not in wdict):

wdict[w] = 1

else:

wdict[w] += 1

# 導入停用詞表

stop = pd.read_csv('stoplist.txt', encoding = 'utf-8', sep = 'zhao', header = None,engine = 'python') # sep:分割符號(需要用一個確定不會出現(xiàn)在停用詞表中的單詞)

stop.columns = ['word']

stop = [' '] + list(stop.word) # python讀取時不會讀取到空格。但空格依舊需要去除。所以加上空格; 讀取后的stop是series的結(jié)構(gòu),需要轉(zhuǎn)成列表

for i in range(len(stop)):

if(stop[i] in wdict):

wdict.pop(stop[i])

ind = list(wdict.keys())

val = list(wdict.values())

ind = pd.Series(ind)

val = pd.Series(val)

data = pd.DataFrame()

data['詞'] = ind

data['詞頻'] = val

return data

最后的結(jié)果截圖為:

詞云圖:

出場統(tǒng)計:

字頻統(tǒng)計:有點多就只截一部分

以上便為《紅樓夢》的詞云分析及字頻統(tǒng)計、出場統(tǒng)計。主要是為了記錄一下我昨天的課程設(shè)計作業(yè),代碼有借鑒。

總結(jié)

以上是生活随笔為你收集整理的python统计三国_如何用python对《三国演义》、《红楼梦》等名著开展词云分析及字频统计、出场统计等工作。...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。