當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Tensorflow使用LSTM实现中文文本分类（1）

發(fā)布時間：2024/1/8 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 Tensorflow使用LSTM实现中文文本分类（1）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

前言

使用Tensorflow，利用LSTM進(jìn)行中文文本的分類。
數(shù)據(jù)集格式如下：
‘’’
體育馬曉旭意外受傷讓國奧警惕無奈大雨格外青睞殷家軍記者傅亞雨沈陽報道來到沈陽，國奧隊依然沒有擺脫雨水的困擾。…
‘’’
可以看出 label：體育，接著是一個 tab，最后跟隨一段文字。
目標(biāo)：傳入模型一段文字，預(yù)測出這段文字所屬類別。

數(shù)據(jù)集下載

中文文本分類數(shù)據(jù)集下載：https://download.csdn.net/download/missyougoon/11221027

文本預(yù)處理

中文分詞

詞語轉(zhuǎn)化為 id ，embeding
例如：詞語A 轉(zhuǎn)化為 id（5）
同時，將 label 轉(zhuǎn)化 id

統(tǒng)計詞頻

代碼演示

# -*- coding:utf-8 -*-import sys import os import jieba# 輸入文件 train_file = './news_data/cnews.train.txt' val_file = './news_data/cnews.val.txt' test_file = './news_data/cnews.test.txt'# 分詞結(jié)果 seg_train_file = './news_data/cnews.train.seg.txt' seg_val_file = './news_data/cnews.val.seg.txt' seg_test_file = './news_data/cnews.test.seg.txt'# 詞語和 label到id 的映射 vocab_file = './news_data/cnews.vocab.txt' category_file = './news_data/cnews.category.txt'#print(label)def generate_seg_file(input_file, output_seg_file):'''生成分詞之后的文本數(shù)據(jù):param input_file: 待分詞的輸入文件:param output_seg_file: 已經(jīng)分詞完畢的文本:return:'''with open(input_file, 'r') as f:lines = f.readlines()with open(output_seg_file, 'w') as f:for line in lines:label, content = line.strip('\n').split('\t')word_iter = jieba.cut(content)word_content = ''for word in word_iter:word = word.strip(' ')if word != '':word_content += word + ' 'out_line = '%s\t%s\n'%(label, word_content.strip(' ')) # 將最后一個空格刪除f.write(out_line)# 對三個文件進(jìn)行分詞 #generate_seg_file(train_file, seg_train_file) #generate_seg_file(val_file, seg_val_file) #generate_seg_file(test_file, seg_test_file)def generate_vocab_file(input_seg_file, output_vocab_file):''':param input_seg_file: 已經(jīng)分詞的文件:param output_vocab_file: 輸出的詞表:return:'''with open(input_seg_file, 'r') as f:lines = f.readlines()word_dict = {} # 統(tǒng)計詞頻信息，因為我們只需要關(guān)注的是詞頻for line in lines:label, content = line.strip('\n').split('\t')for word in content.split(' '):word_dict.setdefault(word, 0) # 如果沒有這個詞語，就把給詞語的默認(rèn)值設(shè)為 0word_dict[word] += 1# dict.item() 將字典轉(zhuǎn)化為列表# 詳情參考：http://www.runoob.com/python/att-dictionary-items.htmlsorted_word_dict = sorted(word_dict.items(), key=lambda d:d[1], reverse=True)# 現(xiàn)在sorted_word_dict的格式為： [(word, frequency).....(word, frequency)]with open(output_vocab_file, 'w') as f:f.write('<UNK>\t1000000\n') # 因為不是所有詞匯都有的，對于一些沒有的詞匯，就用 unk 來代替for item in sorted_word_dict:f.write('%s\t%d\n'%(item[0], item[1]))#generate_vocab_file(seg_train_file, vocab_file) # 從訓(xùn)練集中統(tǒng)計詞表def generate_category_dict(input_file, category_file):with open(input_file, 'r') as f:lines = f.readlines()category_dict = {}for line in lines:label, content = line.strip('\n').split('\t')category_dict.setdefault(label, 0)category_dict[label] += 1category_number = len(category_dict)with open(category_file, 'w') as f:for category in category_dict:line = '%s\n' % category # 現(xiàn)在才知道，原來遍歷字典，原來默認(rèn)查出的是keyprint('%s\t%d' % (category, category_dict[category]))f.write(line)generate_category_dict(train_file, category_file)

數(shù)據(jù)預(yù)處理完畢，接下來進(jìn)行模型的訓(xùn)練和測試，請參考： Tensorflow使用LSTM實現(xiàn)中文文本分類（二）

總結(jié)

以上是生活随笔為你收集整理的Tensorflow使用LSTM实现中文文本分类（1）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Windows平台快速安装MongoDB
下一篇： UE4使用OpenCV插件调用电脑摄像头