机器人写诗项目——数据预处理
生活随笔
收集整理的這篇文章主要介紹了
机器人写诗项目——数据预处理
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
首先來看全部代碼
import collectionsstart_token = 'G' end_token = 'E'def process_poems(file_name):# 詩集poems = []with open(file_name, "r", encoding='utf-8', ) as f:for line in f.readlines():try:title, content = line.strip().split(':')content = content.replace(' ', '')if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or \start_token in content or end_token in content:continueif len(content) < 5 or len(content) > 79:continuecontent = start_token + content + end_tokenpoems.append(content)except ValueError as e:pass# 按詩的字?jǐn)?shù)排序poems = sorted(poems, key=lambda l: len(line))# 統(tǒng)計(jì)每個(gè)字出現(xiàn)次數(shù)all_words = []for poem in poems:all_words += [word for word in poem]# 這里根據(jù)包含了每個(gè)字對(duì)應(yīng)的頻率counter = collections.Counter(all_words)count_pairs = sorted(counter.items(), key=lambda x: -x[1])words, _ = zip(*count_pairs)# 取前多少個(gè)常用字words = words[:len(words)] + (' ',)# 每個(gè)字映射為一個(gè)數(shù)字IDword_int_map = dict(zip(words, range(len(words))))poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]return poems_vector, word_int_map, words之后看一下數(shù)據(jù)集
最后來一點(diǎn)點(diǎn)分析
定義一個(gè)數(shù)據(jù)預(yù)處理函數(shù):
def process_poems(file_name):首先把處理好的結(jié)果指定成一個(gè)list:
poems = []打開處理模塊,首先制定好一個(gè)路徑,然后以讀的方式打開 ,最后因?yàn)樵娛侵形牡?#xff0c;所以編碼方式為‘utf-8’:
with open(file_name, "r", encoding='utf-8', ) as f:一行一行去讀
for line in f.readlines():用冒號(hào)將文本分割為詩的題目和內(nèi)容:
title, content = line.strip().split(':')如果訓(xùn)練數(shù)據(jù)集中古詩存在問題,應(yīng)該舍棄該詩:
if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or \start_token in content or end_token in content:continueif len(content) < 5 or len(content) > 79:continue對(duì)詩的內(nèi)容進(jìn)行處理,加上開始和中止符號(hào),然后才能將詩的內(nèi)容傳進(jìn)結(jié)果的list里:
content = start_token + content + end_tokenpoems.append(content)對(duì)得到的結(jié)果list進(jìn)行排序處理:
poems = sorted(poems, key=lambda l: len(line))統(tǒng)計(jì)每個(gè)字出現(xiàn)的次數(shù),兩層循環(huán),第一層是循環(huán)每一首詩,第二層是循環(huán)每首詩里的每一個(gè)字:
all_words = []for poem in poems:all_words += [word for word in poem]計(jì)算詞頻:
counter = collections.Counter(all_words)count_pairs = sorted(counter.items(), key=lambda x: -x[1])words, _ = zip(*count_pairs)取前多少個(gè)常用字:
words = words[:len(words)] + (' ',)每個(gè)字映射為一個(gè)數(shù)字ID:
word_int_map = dict(zip(words, range(len(words))))poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]返回所需要的值:
return poems_vector, word_int_map, words總結(jié)
以上是生活随笔為你收集整理的机器人写诗项目——数据预处理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 安装好Pycharm后如何配置Pytho
- 下一篇: LDC1000循迹小车