當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器人写诗项目——数据预处理

發(fā)布時(shí)間：2024/5/6 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了机器人写诗项目——数据预处理小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

首先來看全部代碼

import collectionsstart_token = 'G' end_token = 'E'def process_poems(file_name):# 詩集poems = []with open(file_name, "r", encoding='utf-8', ) as f:for line in f.readlines():try:title, content = line.strip().split(':')content = content.replace(' ', '')if '_' in content or '(' in content or '（' in content or '《' in content or '[' in content or \start_token in content or end_token in content:continueif len(content) < 5 or len(content) > 79:continuecontent = start_token + content + end_tokenpoems.append(content)except ValueError as e:pass# 按詩的字?jǐn)?shù)排序poems = sorted(poems, key=lambda l: len(line))# 統(tǒng)計(jì)每個(gè)字出現(xiàn)次數(shù)all_words = []for poem in poems:all_words += [word for word in poem]# 這里根據(jù)包含了每個(gè)字對(duì)應(yīng)的頻率counter = collections.Counter(all_words)count_pairs = sorted(counter.items(), key=lambda x: -x[1])words, _ = zip(*count_pairs)# 取前多少個(gè)常用字words = words[:len(words)] + (' ',)# 每個(gè)字映射為一個(gè)數(shù)字IDword_int_map = dict(zip(words, range(len(words))))poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]return poems_vector, word_int_map, words

之后看一下數(shù)據(jù)集

最后來一點(diǎn)點(diǎn)分析

定義一個(gè)數(shù)據(jù)預(yù)處理函數(shù)：

def process_poems(file_name):

首先把處理好的結(jié)果指定成一個(gè)list：

poems = []

打開處理模塊，首先制定好一個(gè)路徑，然后以讀的方式打開，最后因?yàn)樵娛侵形牡?#xff0c;所以編碼方式為‘utf-8’：

with open(file_name, "r", encoding='utf-8', ) as f:

一行一行去讀

for line in f.readlines():

用冒號(hào)將文本分割為詩的題目和內(nèi)容：

title, content = line.strip().split(':')

如果訓(xùn)練數(shù)據(jù)集中古詩存在問題，應(yīng)該舍棄該詩：

if '_' in content or '(' in content or '（' in content or '《' in content or '[' in content or \start_token in content or end_token in content:continueif len(content) < 5 or len(content) > 79:continue

對(duì)詩的內(nèi)容進(jìn)行處理，加上開始和中止符號(hào)，然后才能將詩的內(nèi)容傳進(jìn)結(jié)果的list里：

content = start_token + content + end_tokenpoems.append(content)

對(duì)得到的結(jié)果list進(jìn)行排序處理：

poems = sorted(poems, key=lambda l: len(line))

統(tǒng)計(jì)每個(gè)字出現(xiàn)的次數(shù)，兩層循環(huán)，第一層是循環(huán)每一首詩，第二層是循環(huán)每首詩里的每一個(gè)字：

all_words = []for poem in poems:all_words += [word for word in poem]

計(jì)算詞頻：

counter = collections.Counter(all_words)count_pairs = sorted(counter.items(), key=lambda x: -x[1])words, _ = zip(*count_pairs)

取前多少個(gè)常用字：

words = words[:len(words)] + (' ',)

每個(gè)字映射為一個(gè)數(shù)字ID：

word_int_map = dict(zip(words, range(len(words))))poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]

返回所需要的值：

return poems_vector, word_int_map, words

總結(jié)

以上是生活随笔為你收集整理的机器人写诗项目——数据预处理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：安装好Pycharm后如何配置Pytho
下一篇： LDC1000循迹小车