【參考:nlp-tutorial/Word2Vec-Skipgram.py at master · graykode/nlp-tutorial】
【參考:Word2Vec的PyTorch實(shí)現(xiàn)_嗶哩嗶哩_bilibili】
【參考:Word2Vec的PyTorch實(shí)現(xiàn)(乞丐版) - mathor】
總結(jié):
構(gòu)建word2id
構(gòu)建數(shù)據(jù)
- 窗口內(nèi)的單詞為【C-2,C-1,C,C+1,C+2】
- 數(shù)據(jù) [[C,C-2],[C,C-1],[C,C+1],[C,C+2]]- np.eye(voc_size) 用onehot表示單詞送入模型訓(xùn)練
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.utils.data as Datadtype = torch.FloatTensor
device = torch.device("cuda"if torch.cuda.is_available()else"cpu")sentences =["jack like dog","jack like cat","jack like animal","dog cat animal","banana apple cat dog like","dog fish milk like","dog cat animal like","jack like apple","apple like","jack like banana","apple banana jack movie book music like","cat dog hate","cat dog like"]word_sequence =" ".join(sentences).split()# ['jack', 'like', 'dog', 'jack', 'like', 'cat', 'animal',...]
vocab =list(set(word_sequence))# build words vocabulary
word2idx ={w: i for i, w inenumerate(vocab)}# 假設(shè)為{'jack':0, 'like':1,'dog':2,...}# Word2Vec Parameters
batch_size =8
embedding_size =2# 2 dim vector represent one word
C =2# window size 窗口內(nèi)的單詞為【C-2,C-1,C,C+1,C+2】
voc_size =len(vocab)# 中心詞 center# 背景詞 context 即上下文詞匯skip_grams =[]# 這里必須起始從第三個(gè)詞開(kāi)始,因?yàn)閣indow size為2for idx inrange(C,len(word_sequence)- C):# 舉例 idx=2 對(duì)應(yīng)word_sequence[idx]單詞為'dog',對(duì)應(yīng)的word2idx索引為2 前面兩個(gè)是'jack', 'like',后面兩個(gè)是'jack', 'like'center = word2idx[word_sequence[idx]]# center word# [0,1,3,4] 即前C個(gè)詞和后C個(gè)詞context_idx =list(range(idx - C, idx))+list(range(idx +1, idx + C +1))# context word idx 上下文單詞的在word_sequence中的下標(biāo)# word_sequence[i]分別對(duì)應(yīng)'jack', 'like' ,'dog', 'jack' 對(duì)應(yīng)的word2idx索引為0,1,0,1context =[word2idx[word_sequence[i]]for i in context_idx]for w in context:skip_grams.append([center, w])# skip_grams:[[2,0],[2,1],[2,0],[2,1]]defmake_data(skip_grams):input_data =[]output_data =[]for i inrange(len(skip_grams)):# eye 是單位矩陣,維度為voc_size 以skip_grams[i][0]的值為下標(biāo)取出在單位矩陣對(duì)應(yīng)的行向量# 舉例 [2,0] skip_grams[0][0]為2,即取出單位矩陣的第三行(相當(dāng)于把單詞用onehot表示)input_data.append(np.eye(voc_size)[skip_grams[i][0]])# 標(biāo)簽值 output_data 即為skip_grams[i][1]:0output_data.append(skip_grams[i][1])return input_data, output_datainput_data, output_data = make_data(skip_grams)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size,True)# ModelclassWord2Vec(nn.Module):def__init__(self):super(Word2Vec, self).__init__()# W and V is not Traspose relationship# 技巧:先確定輸入輸出的shape,再來(lái)推出超參數(shù)的shape# 下面就是先確定輸入X和隱藏層輸出的shape,再來(lái)反推W# 因?yàn)閄 : [batch_size, voc_size],隱藏層需要輸出[batch_size, embedding_size],所以W應(yīng)該是[voc_size,embedding_size]self.W = nn.Parameter(torch.randn(voc_size, embedding_size).type(dtype))# V 同理self.V = nn.Parameter(torch.randn(embedding_size, voc_size).type(dtype))defforward(self, X):# X : [batch_size, voc_size] one-hot batch_size就相當(dāng)于圖中的minibatch行,圖中有voc_size列# torch.mm only for 2 dim matrix, but torch.matmul can use to any dimhidden_layer = torch.matmul(X, self.W)# hidden_layer : [batch_size, embedding_size]# 相當(dāng)于有voc_size個(gè)分類(lèi),即詞典里面的每個(gè)詞都是一個(gè)種類(lèi)output_layer = torch.matmul(hidden_layer, self.V)# output_layer : [batch_size, voc_size]return output_layermodel = Word2Vec().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)# Trainingfor epoch inrange(2000):for i,(batch_x, batch_y)inenumerate(loader):batch_x = batch_x.to(device)batch_y = batch_y.to(device)pred = model(batch_x)loss = criterion(pred, batch_y)if(epoch +1)%1000==0:print(epoch +1, i, loss.item())optimizer.zero_grad() loss.backward()optimizer.step()for i, label inenumerate(vocab):W, WT = model.parameters()# WT就是self.V x,y =float(W[i][0]),float(W[i][1])# embedding_size = 2 print(label)print(x,y)plt.scatter(x, y)plt.annotate(label, xy=(x, y), xytext=(5,2), textcoords='offset points', ha='right', va='bottom')
plt.show()100002.187922716140747100012.1874611377716064100022.1020612716674805100032.1360023021698100041.6479374170303345100052.1080777645111084100062.117255687713623100072.5754618644714355100082.375575065612793100092.48127722740173341000102.22791862487792971000111.99581313133239751000121.96664726734161381000131.7927737236022951000141.97902894020080571000152.1500973701477051000161.82309162616729741000171.99168455600738531000182.23543930053710941000192.2530589103698731000201.8957509994506836200002.1660408973693848200011.9071791172027588200021.9131343364715576200032.0996546745300293200041.9192123413085938200051.6349347829818726200062.433778762817383200072.4247307777404785200082.1594560146331787200091.95432984828948972000101.80783331394195562000112.4900555610656742000122.19419336318969732000132.4634535312652592000142.28498888015747072000151.77840888500213622000161.88034045696258542000171.96453213691711432000182.0360784530639652000191.92391777038574222000202.261594772338867
animal
-0.52637565135955813.4223508834838867
apple
-0.33845159411430361.3274422883987427
milk
-1.23583424091339110.3438951075077057
hate
-1.5564047098159799.134812355041504
music
0.313928365707397460.2262829840183258
movie
2.3753826618194581.1577153205871582
dog
-0.90165680646896360.2671743929386139
jack
-0.58785033226013180.6020950078964233
cat
-0.90749329328536990.2849980890750885
banana
0.478504627943038941.1545497179031372
book
0.47617280483245850.21939511597156525
like
-0.14968748390674590.6957748532295227
fish
-2.377621889114380.04009028896689415
因?yàn)閿?shù)據(jù)集 jack like 動(dòng)物名 比較多,所以這幾個(gè)詞在空間中也挨得比較近
sentences =["jack like dog","jack like cat","jack like animal","dog cat animal","banana apple cat dog like","dog fish milk like","dog cat animal like","jack like apple","apple like","jack like banana","apple banana jack movie book music like","cat dog hate","cat dog like"]