當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

简单有趣的 NLP 教程：手把手教你用 PyTorch 辨别自然语言（附代码）

發布時間：2025/3/21 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了简单有趣的 NLP 教程：手把手教你用 PyTorch 辨别自然语言（附代码）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

簡單有趣的 NLP 教程：手把手教你用 PyTorch 辨別自然語言（附代碼）

雷鋒網(公眾號：雷鋒網)按：本文作者甄冉冉，原載于作者個人博客，雷鋒網已獲授權。

最近在學pyTorch的實際應用例子。這次說個簡單的例子：給定一句話，判斷是什么語言。這個例子是比如給定一句話：

Give it to me

判斷是 ENGLISH

me gusta comer en la cafeteria

判斷是 SPANISH

就是這么簡單的例子。

來看怎么實現：

準備數據格式 [(語句，類型)，...]

data是train的時候用的語句，test_data是test的時候用的語句

data = [ ("me gusta comer en la cafeteria".split(), "SPANISH"),

? ? ? ? ?("Give it to me".split(), "ENGLISH"),

? ? ? ? ?("No creo que sea una buena idea".split(), "SPANISH"),

? ? ? ? ?("No it is not a good idea to get lost at sea".split(), "ENGLISH") ]

test_data = [("Yo creo que si".split(), "SPANISH"),

? ? ? ? ? ? ? ("it is lost on me".split(), "ENGLISH")]

因為文本計算機室識別不出來的，他們只認識01串，也就是數字。所以我們得把文本映射到數字上。

word_to_ix = {}

for sent, _ in data + test_data:

? ? for word in sent:

? ? ? ? if word not in word_to_ix:

? ? ? ? ? ? word_to_ix[word] = len(word_to_ix)

print(word_to_ix)

輸出word_to_ix (意思是word to index)是：

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}

這里先提前設置下接下來要用到的參數

VOCAB_SIZE = len(word_to_ix)

NUM_LABELS = 2#只有兩類 ENGLISH ?SPANISH

固定模板

def init(self, num_labels, vocab_size)：初始化，就是輸入和輸出的大小。這里我們要輸入是一個句子，句子最大就是擁有所有字典的詞，這里也就是vocab_size(下面再說怎么將一句話根據字典轉換成一個數字序列的)，輸出就是分類，這里分為2類，即num_labels。這里我們用的是線性分類，即nn.Linear()。

def forward(self, bow_vec)：bow_vec是一個句子的數字化序列，經過self.linear()得到一個線性結果(也就是預測結果)，之后對這個結果進行softmax(這里用log_softmax是因為下面的損失函數用的是NLLLoss() 即負對數似然損失，需要log以下)

class BoWClassifier(nn.Module):#nn.Module 這是繼承torch的神經網絡模板

? ? def __init__(self, num_labels, vocab_size):?

? ? ? ? super(BoWClassifier, self).__init__()

? ? ? ? self.linear = nn.Linear(vocab_size, num_labels)

? ? def forward(self, bow_vec):

? ? ? ? return F.log_softmax(self.linear(bow_vec))

def make_bow_vector(sentence, word_to_ix)

大概能看懂什么意思吧。就是把一個句子sentence通過word_to_ix轉換成數字化序列.比如 sentence=我是一只小小鳥 word_to_id={你:0,我:1,他:2,不:3,是:4,大:5,小:6,豬:7,鳥:8,,} make_bow_vector之后的結果是[0,1,0,0,1,0,2,0,1]。view()就是改變下向量維數。

這里是講len(word_to_ix)1->1len(word_to_ix)

def make_bow_vector(sentence, word_to_ix):

? ? vec = torch.zeros(len(word_to_ix))

? ? for word in sentence:

? ? ? ? vec[word_to_ix[word]] += 1

? ? return vec.view(1, -1)

這個就不用說了吧一樣。(如果想知道torch.LongTensor啥意思的話?？梢钥纯?。Torch中，Tensor主要有ByteTensor（無符號char），CharTensor（有符號)，ShortTensor(shorts), IntTensor(ints), LongTensor(longs), FloatTensor(floats), DoubleTensor(doubles)，默認存放為double類型，如果需要特別指出，通過torch.setdefaulttensortype()方法進行設定。例如torch.setdefaulttensortype(‘torch.FloatTensor’)。 )

def make_target(label, label_to_ix):

? ? return torch.LongTensor([label_to_ix[label]])

這里再介紹下model.parameters()這個函數。他的返回結果是model里的所有參數。這里我們用的是線性函數，所以就是f(x)=Ax+b中的A和b(x即輸入的數據)，這些參數在之后的反饋和更新參數需要的。

model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

for param in model.parameters():

? ? print("param:", param)

可以看出A是2len(vocab_size)，b是21

param: Parameter containing:

Columns 0 to 9?

?0.0786 ?0.1596 ?0.1259 ?0.0054 ?0.0558 -0.0911 -0.1804 -0.1526 -0.0287 -0.1086

-0.0651 -0.1096 -0.1807 -0.1907 -0.0727 -0.0179 ?0.1530 -0.0910 ?0.1943 -0.1148

Columns 10 to 19?

?0.0452 -0.0786 ?0.1776 ?0.0425 ?0.1194 -0.1330 -0.1877 -0.0412 -0.0269 -0.1572

-0.0361 ?0.1909 ?0.1558 ?0.1309 ?0.1461 -0.0822 ?0.1078 -0.1354 -0.1877 ?0.0184

Columns 20 to 25?

?0.1818 -0.1401 ?0.1118 ?0.1002 ?0.1438 ?0.0790

?0.1812 -0.1414 -0.1876 ?0.1569 ?0.0804 -0.1897

[torch.FloatTensor of size 2x26]

param: Parameter containing:

?0.1859

?0.1245

[torch.FloatTensor of size 2]

我們再看看model的def forward(self, bow_vec):怎么用。這里就想下面的代碼一樣，直接在mode()填一個參數即可，就調用forward函數。

sample = data[0]

bow_vector = make_bow_vector(sample[0], word_to_ix)

log_probs = model(autograd.Variable(bow_vector))

print("log_probs", log_probs)

輸出是：（就是log_softmax后的值）

log_probs Variable containing:

-0.6160 -0.7768

[torch.FloatTensor of size 1x2]

我們這里看看在test上的預測

label_to_ix = { "SPANISH": 0, "ENGLISH": 1 }

for instance, label in test_data:

? ? bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))

? ? log_probs = model(bow_vec)

? ? print log_probs

print next(model.parameters())[:,word_to_ix["creo"]]

結果是

Variable containing:

-0.5431 -0.8698

[torch.FloatTensor of size 1x2]

Variable containing:

-0.7405 -0.6480

[torch.FloatTensor of size 1x2]

Variable containing:

-0.0467

?0.1065

[torch.FloatTensor of size 2]

下面就該進行重要的部分了。

循環訓練和更新參數

這里我們用的損失函數是nn.NLLLoss()負對數似然損失，優化依然用的最常見的optim.SGD() 梯度下降法，一般訓練5-30次最終優化基本不再變化。

每一步過程：

a.?首先都要model.zero_grad()，因為接下來要極端梯度，得清零，以防問題

b.?將數據向量化(也可以說是數字序列化，轉成計算機能看懂的形式)

c.?得到預測值

d.?求損失loss_function

e.?求梯度loss.backward()

f.?更新參數optimizer.step()

loss_function = nn.NLLLoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(100):

? ? for instance, label in data:

? ? ? ? model.zero_grad()

? ? ? ? bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))

? ? ? ? target = autograd.Variable(make_target(label, label_to_ix))

? ? ? ? log_probs = model(bow_vec)

? ? ? ? loss = loss_function(log_probs, target)

? ? ? ? loss.backward()

? ? ? ? optimizer.step()

在測試集上測試

for instance, label in test_data:

? ? bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))

? ? log_probs = model(bow_vec)

? ? print log_probs

我們在結果上很容易看到第一個例子預測是SPANISH最大，第二個是ENGLISH最大。成功了。

Variable containing:

-0.0842 -2.5161

[torch.FloatTensor of size 1x2]

Variable containing:

-2.4886 -0.0867

[torch.FloatTensor of size 1x2]

本文作者：AI研習社

2017-06-20 15:41

總結

以上是生活随笔為你收集整理的简单有趣的 NLP 教程：手把手教你用 PyTorch 辨别自然语言（附代码）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：不到 200 行代码，教你如何用 Ker
下一篇：谷歌研发能处理多域多任务的机器学习模型—

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

简单有趣的 NLP 教程：手把手教你用 PyTorch 辨别自然语言（附代码）

總結