一、word2vec的使用
一、處理短句子
from gensim.models import Word2Vec sentences = [["Python", "深度學習", "機器學習"], ["NLP", "深度學習", "機器學習"]] model = Word2Vec(sentences, min_count=1)
注意:把Python內置列表當作輸入很方便,但當輸入量很大的時候,大會占用大量內存。
二、語料是文件
1、Gensim需要輸入一個可迭代的列表,可以是迭代器,沒有必要把一切東西都保存在內存中,提供一個語句,加載處理它,忘記它,加載另一個語句。
2、一般我們的語料是在文件中存放的,首先,需要保證語料文件內部每一行對應一個句子(已經分詞,以空格隔開),方法見上。
三、對一個目錄下的所有文件生效
這些文件已經被分詞好了,如果還需要進一步預處理文件中的單詞,如移除數字,提取命名實體… 所有的這些都可以在MySentences 迭代器內進行,保證給work2vec的是處理好的迭代器。
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
sentences = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)
四、對于單個文件
class: gensim.models.word2vec.LineSentence
每一行對應一個句子(已經分詞,以空格隔開),我們可以直接用LineSentence把txt文件轉為所需要的格式。
LineSentence功能解釋:Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace(對包含句子的文件進行迭代:一行=一句話。單詞必須經過預處理,并由空格分隔)
from gensim import Word2Vec
from gensim.Word2Vec import LineSentence
from gensim.test.utils import common_texts, get_tmpfile
# inp為輸入語料
inp = 'wiki.zh.text.jian.seg.txt'
sentences = LineSentences(inp)
path = get_tmpfile("word2vec.model") #創建臨時文件
model = Word2Vec(sentences, size=100, window=5, min_count=1)
model.save("word2vec.model")
gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)
預處理類,限制句子最大長度,文檔最大行數
拿到了分詞后的文件,在一般的NLP處理中,會需要去停用詞。由于word2vec的算法依賴于上下文,而上下文有可能就是停詞。因此對于word2vec,我們可以不用去停詞。
五、獲取語料
1、https://files-cdn.cnblogs.com/files/pinard/in_the_name_of_people.zip
或者
class gensim.models.word2vec.Text8Corpus(fname, max_sentence_length=10000) Bases: object 從一個叫‘text8’的語料庫中獲取數據,該語料來源于以下網址,參數max_sentence_length限定了獲取的語料長度 Iterate over sentences from the “text8” corpus, unzipped from http://mattmahoney.net/dc/text8.zip
2、代碼
import jieba
import jieba.analyse
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
import gensim
with open("in_the_name_of_people.txt", encoding="utf8") as f:
document = f.read()
document_cut = jieba.cut(document)
result = " ".join(document_cut)
with open("segment.txt", "w", encoding="utf8") as fout:
fout.write(result)
sentences = gensim.models.word2vec.LineSentence("segment.txt")
model = Word2Vec(sentences, hs=0,min_count=5,window=5,size=100)
# 上下文窗口大小:window=5
# 忽略低頻次term:min_count=5
# 語言模型是用CBOW還是skip-gram?sg=0 是CBOW
# 優化方法是用層次softmax還是負采樣:hs=0 是負采樣
# 負采樣樣本數: negative=5 (一般設為5-20)
# 負采樣采樣概率的平滑指數:ns_exponent=0.75
# 高頻詞抽樣的閾值 sample=0.001
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")
for key in model.wv.similar_by_word('檢察院', topn =10):
print(key)
從bin中加載模型:
# with open("in_the_name_of_people.txt", encoding="utf8") as f:
# document = f.read()
# document_cut = jieba.cut(document)
# result = " ".join(document_cut)
# with open("segment.txt", "w", encoding="utf8") as fout:
# fout.write(result)
#
# sentences = gensim.models.word2vec.LineSentence("segment.txt")
# model = Word2Vec(sentences, hs=0,min_count=5,window=5,size=100)
# # 上下文窗口大小:window=5
# # 忽略低頻次term:min_count=5
# # 語言模型是用CBOW還是skip-gram?sg=0 是CBOW
# # 優化方法是用層次softmax還是負采樣:hs=0 是負采樣
# # 負采樣樣本數: negative=5 (一般設為5-20)
# # 負采樣采樣概率的平滑指數:ns_exponent=0.75
# # 高頻詞抽樣的閾值 sample=0.001
# model.save("word2vec.model")
# model = gensim.models.KeyedVectors.load_word2vec_format('baike_26g_news_13g_novel_229g.bin', binary=True)
# sentence1 = "北京是中華人民供各國的首都"
# sentence2 = "人民民主"
# cut1 = jieba.cut(sentence1)
# cut2 = jieba.cut(sentence2)
#
# def getNumPyVec(list_cut):
# vecList = []
# for x in list_cut:
# vecList.append(model[x])
# torch_list = torch.tensor(vecList)
# print(torch_list.shape)
#
#
# l1 = getNumPyVec(cut1)
# l2 = getNumPyVec(cut2)
#輸入矩陣特征數input_size、輸出矩陣特征數hidden_size、層數num_layers
# lstm = nn.LSTM(128,20,4) #(input_size,hidden_size,num_layers)
# h0 = torch.randn(4,3,20) #(num_layers* 1,batch_size,hidden_size)
# c0 = torch.randn(4,3,20) #(num_layers*1,batch_size,hidden_size)
# inputs = torch.randn(10,3,128) #(seq_len,batch_size,input_size)
# output,(hn,cn) = lstm(inputs,(h0,c0))
總結
以上是生活随笔為你收集整理的一、word2vec的使用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 白带净丸_功效作用注意事项用药禁忌用法用
- 下一篇: 毛坯房刷墙步骤(毛坯房批墙步骤)