當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python统计汉字个数是_Python中文词频统计

發(fā)布時(shí)間：2023/12/10 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 python统计汉字个数是_Python中文词频统计小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

今天看到的一個(gè)統(tǒng)計(jì)，統(tǒng)計(jì)的金庸小說(shuō)里面的高頻詞語(yǔ)。想著看了一周python，試試看能不能統(tǒng)計(jì)。

網(wǎng)上找的代碼，調(diào)整順序拼接了一下，分詞庫(kù)是結(jié)巴分詞。

解決了python2.7中字典顯示中文亂碼的問(wèn)題

分詞代碼：https://github.com/imwilsonxu/mao

頻率統(tǒng)計(jì)：https://github.com/aolingwen/0006

結(jié)巴分詞：https://github.com/fxsjy/jieba

# -*- coding: utf-8 -*-

import json

import re

import jieba

from collections import Counter

class StatWords(object):

def statTopN(self,path, n):

file = open(path,'r')

wordDict = {}

content = file.read()

wordlist = re.split('[\s\ \\,\;\.\!\n]+', content)

for word in wordlist:

if word in wordDict:

wordDict[word]=wordDict[word]+1

else:

wordDict[word] = 1

count = Counter(wordDict)

print json.dumps(count.most_common()[:n], encoding="UTF-8", ensure_ascii=False)

STOPWORDS = [u'的', u'地', u'得', u'而', u'了', u'在', u'是', u'我', u'有', u'和',

u'就', u'不', u'人', u'都', u'一', u'一個(gè)', u'上', u'也', u'很', u'到', u'說(shuō)', u'要',

u'去', u'你', u'會(huì)', u'著', u'沒(méi)有', u'看', u'好', u'自己', u'這']

PUNCTUATIONS = [u'。', u'，', u'“', u'”', u'…', u'？', u'！', u'、', u'；', u'（',

u'）',u'?',u'：']

#黑名單

f_in = open('file_in.txt')

f_out = open('file_out.txt', 'w')

#f_in原文檔，f_out分詞后的文檔

try:

for l in f_in:

seg_list = jieba.cut(l)

# print "/".join(seg_list)

for seg in seg_list:

if seg not in STOPWORDS and seg not in PUNCTUATIONS:

f_out.write(seg.encode('utf-8', 'strict') + "\n")

finally:

f_in.close()

f_out.close()

if __name__ == '__main__':

s = StatWords()

s.statTopN("file_out.txt",10)

總結(jié)

以上是生活随笔為你收集整理的python统计汉字个数是_Python中文词频统计的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：招商银行信用卡账单地址怎么修改
下一篇： openpyxl 绘制饼形图_好享学丨快