當前位置：首頁 >

[NLP-CNN] Convolutional Neural Networks for Sentence Classification -2014-EMNLP

發布時間：2023/12/10 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 [NLP-CNN] Convolutional Neural Networks for Sentence Classification -2014-EMNLP 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. Overview

本文將CNN用于句子分類任務

(1) 使用靜態vector + CNN即可取得很好的效果；=> 這表明預訓練的vector是universal的特征提取器，可以被用于多種分類任務中。

(2) 根據特定任務進行fine-tuning 的vector + CNN 取得了更好的效果。

(3) 改進模型架構，使得可以使用 task-specific 和 static 的vector。

(4) 在7項任務中的4項取得了SOTA的效果。

思考：卷積神經網絡的核心思想是捕獲局部特征。在圖像領域，由于圖像本身具有局部相關性，因此，CNN是一個較為適用的特征提取器。在NLP中，可以將一段文本n-gram看做一個有相近特征的片段——窗口，因而希望通過CNN來捕獲這個滑動窗口內的局部特征。卷積神經網絡的優勢在于可以對這樣的n-gram特征進行組合和篩選，獲取不同的抽象層次的語義信息。

2. Model

對于該模型，主要注意三點：

1. 如何應用的CNN，即在文本中如何使用CNN

2. 如何將static和fine-tuned vector結合在一個架構中

3. 正則化的策略

本文的思路是比較簡單的。

2.1 CNN的應用

<1> feature map 的獲取

word vector 是k維，sentence length = n (padded)，則將該sentence表示為每個單詞的簡單的concat,如fig1所示，組成最左邊的矩形。

卷積核是對窗口大小為h的詞進行卷積。大小為h的窗口內單詞的表征為 h * k 維度，那么設定一個維度同樣為h*k的卷積核 w，對其進行卷積運算。

之后加偏置，進行非線性變換即可得到經過CNN之后提取的特征的表征$c_i$。

這個$c_i$是某一個卷積核對一個窗口的卷積后的特征表示，對于長度為n的sentence，滑動窗口可以滑動n - h + 1次，也就可以得到一個feature map

顯然，$c$的維度為n - h + 1. 當然，這是對一個卷積核獲取的feature map, 為了提取到多種特征，可以設置不同的卷積核，它們對應的卷積核的大小可以不同，也就是h可以不同。

這個過程對應了Figure1中最左邊兩個圖形的過程。

<2> max pooling

這里的max pooling有個名詞叫 max-over-time-pooling.它的over-time體現在：如圖，每個feature map中選擇值最大的組成到max pooling后的矩陣中，而這個feature map則是沿著滑動窗口，也就是沿著文本序列進行卷積得到的，那么也就是max pooling得到的是分別在每一個卷積核卷積下的，某一個滑動窗口--句子的某一個子序列卷積后的值，這個值相比于其他滑動窗口的更大。句子序列是有先后順序的，滑動窗口也是，所以是 over-time.

這里記為：，是對應該filter的最大值。

<3> 全連接層

這里也是采用全連接層，將前面層提取的信息，映射到分類類別中，獲取其概率分布。

2.2 static 和 fine-tuned vector的結合

paper中，將句子分別用 static 和fine-tuned vector 表征為兩個channel。如Figure1最左邊的圖形所示，有兩個矩陣，這兩個矩陣分別表示用static 和fine-tuned vector拼接組成的句子的表征。比如，前面的矩陣的第一行是wait這個詞的static的vector；后面的矩陣的第一行是wait這個詞的fine-tuned的vector.

二者信息如何結合呢？

paper中的策略也很簡單，用同樣的卷積核對其進行特征提取后，將兩個channel獲得的值直接Add在一起，放到feature map中，這樣Figure1中的feature map實際上是兩種vector進行特征提取后信息的綜合。

2.3 正則化的策略

為了避免co-adapation問題，Hinton提出了dropout。在本paper中，對于倒數第二層，也就是max pooling后獲取的部分，也使用這樣的正則化策略。

假設有m個feature map, 那么記。

如果不使用dropout,其經過線性映射的表示為：

那么如果使用dropout，其經過線性映射的表示為：

這里的$r$是m維的mask向量，其值或為0，或為1，其值為1的概率服從伯努利分布。

那么在進行反向傳播時，只有對應mask為1的單元，其梯度才會傳播更新。

在測試階段，權值矩陣w會被scale p倍，即$\hat{w} = pw$，并且$\hat{w}$不進行dropout，來對訓練階段為遇到過的數據進行score.

另外可以選擇對$w$進行$l_2$正則化，當在梯度下降后，$||w||_2 > s$ 時，將其值限制為s.

3. Datasets and Experimental Setup

3.1 Datasets:

1. MR: Movie reviews with one sentence per review. positive/negative reviews

2.?SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).4

3.?SST-2:?Same as SST-1 but with neutral reviews removed and binary labels.

4.?Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004)

5.?TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002)

6.?CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)

7. MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).?

3.2 Hyperparameters and Training

激活函數：ReLU

window(h): 3,4,5, 每個有100個feature map

dropout p = 0.5

l2(s) = 3

mini-batch size = 50

在SST-2的dev set上進行網格搜索(grid search)選擇的以上超參數。

批量梯度下降

使用Adadelta update rule

對于沒有提供標準dev set的數據集，隨機在training data 中選10%作為dev set.

3.3 Pre-trained Word Vectors

word2vec vectors that were trained on 100 billion words from Google News

3.4 Model Variations

paper中提供的幾種模型的變型主要為了測試，初始的word vector的設置對模型效果的影響。

CNN-rand: 完全隨機初始化

CNN-static: 用word2vec預訓練的初始化

CNN-non-static: 用針對特定任務fine-tuned的

CNN-multichannel: 將static與fine-tuned的結合，每個作為一個channel

效果：后三者相比于完全rand的在7個數據集上效果都有提升。

并且本文所提出的這個簡單的CNN模型的效果，和一些利用parse-tree等復雜模型的效果相差很小。在SST-2, CR 中取得了SOTA.

本文提出multichannel的方法，本想希望通過避免overfitting來提升效果的，但是實驗結果顯示，并沒有顯示處完全的優勢，在一些數據集上的效果，不及其他。

4. Code

Theano:?1. paper的實現代碼：yoonkim/CNN_sentence:?https://github.com/yoonkim/CNN_sentence

Tensorflow:?2.?dennybritz/cnn-text-classification-tf: https://github.com/dennybritz/cnn-text-classification-tf

Keras:?3.?alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras:?https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

Pytorch: 4.?Shawn1993/cnn-text-classification-pytorch:?https://github.com/Shawn1993/cnn-text-classification-pytorch

試驗了MR的效果，eval準確率最高為73%，低于github中給出的77.5%和paper中76.1%的準確率；

試驗了SST的效果，eval準確率最高為37%，低于github中給出的37.2%和paper中45.0%的準確率。

這里展示model.py的代碼：

1 import torch 2 import torch.nn as nn 3 import torch.nn.functional as F 4 from torch.autograd import Variable 5 6 7 class CNN_Text(nn.Module): 8 9 def __init__(self, args): 10 super(CNN_Text, self).__init__() 11 self.args = args 12 13 V = args.embed_num 14 D = args.embed_dim 15 C = args.class_num 16 Ci = 1 17 Co = args.kernel_num 18 Ks = args.kernel_sizes 19 20 self.embed = nn.Embedding(V, D) 21 # self.convs1 = [nn.Conv2d(Ci, Co, (K, D)) for K in Ks] 22 self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks]) 23 ''' 24 self.conv13 = nn.Conv2d(Ci, Co, (3, D)) 25 self.conv14 = nn.Conv2d(Ci, Co, (4, D)) 26 self.conv15 = nn.Conv2d(Ci, Co, (5, D)) 27 ''' 28 self.dropout = nn.Dropout(args.dropout) 29 self.fc1 = nn.Linear(len(Ks)*Co, C) 30 31 def conv_and_pool(self, x, conv): 32 x = F.relu(conv(x)).squeeze(3) # (N, Co, W) 33 x = F.max_pool1d(x, x.size(2)).squeeze(2) 34 return x 35 36 def forward(self, x): 37 x = self.embed(x) # (N, W, D) 38 39 if self.args.static: 40 x = Variable(x) 41 42 x = x.unsqueeze(1) # (N, Ci, W, D) 43 44 x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] # [(N, Co, W), ...]*len(Ks) 45 46 x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N, Co), ...]*len(Ks) 47 48 x = torch.cat(x, 1) 49 50 ''' 51 x1 = self.conv_and_pool(x,self.conv13) #(N,Co) 52 x2 = self.conv_and_pool(x,self.conv14) #(N,Co) 53 x3 = self.conv_and_pool(x,self.conv15) #(N,Co) 54 x = torch.cat((x1, x2, x3), 1) # (N,len(Ks)*Co) 55 ''' 56 x = self.dropout(x) # (N, len(Ks)*Co) 57 logit = self.fc1(x) # (N, C) 58 return logit

Pytorch 5.?prakashpandey9/Text-Classification-Pytorch:?https://github.com/prakashpandey9/Text-Classification-Pytorch

注意，該代碼中models的CNN部分是paper的簡單實現，但是代碼的main.py需要有修改

由于選用的是IMDB的數據集，其label是1,2，而pytorch在計算loss時，要求target的范圍在0<= t < n_classes，也就是需要將標簽(1,2)轉換為(0,1)，使其符合pytorch的要求，否則會報錯：“Assertion `t >= 0 && t < n_classes` failed.”

可以通過將標簽2改為0，來實現：

1 target = (target != 2) 2 target = target.long()

應為該代碼中用的損失函數是cross_entropy, 所以應轉為long類型。

方便起見，這里展示修改后的完整的main.py的代碼，里面的超參數可以自行更改。

1 import os 2 import time 3 import load_data 4 import torch 5 import torch.nn.functional as F 6 from torch.autograd import Variable 7 import torch.optim as optim 8 import numpy as np 9 from models.LSTM import LSTMClassifier 10 from models.CNN import CNN 11 12 TEXT, vocab_size, word_embeddings, train_iter, valid_iter, test_iter = load_data.load_dataset() 13 14 def clip_gradient(model, clip_value): 15 params = list(filter(lambda p: p.grad is not None, model.parameters())) 16 for p in params: 17 p.grad.data.clamp_(-clip_value, clip_value) 18 19 def train_model(model, train_iter, epoch): 20 total_epoch_loss = 0 21 total_epoch_acc = 0 22 23 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 24 # model.cuda() 25 # model.to(device) 26 27 optim = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters())) 28 steps = 0 29 model.train() 30 for idx, batch in enumerate(train_iter): 31 text = batch.text[0] 32 target = batch.label 33 ##########Assertion `t >= 0 && t < n_classes` failed.################### 34 target = (target != 2) 35 target = target.long() 36 ######################################################################## 37 # target = torch.autograd.Variable(target).long() 38 39 if torch.cuda.is_available(): 40 text = text.cuda() 41 target = target.cuda() 42 43 if (text.size()[0] is not 32):# One of the batch returned by BucketIterator has length different than 32. 44 continue 45 optim.zero_grad() 46 prediction = model(text) 47 48 prediction.to(device) 49 50 loss = loss_fn(prediction, target) 51 loss.to(device) 52 53 num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).float().sum() 54 acc = 100.0 * num_corrects/len(batch) 55 56 loss.backward() 57 clip_gradient(model, 1e-1) 58 optim.step() 59 steps += 1 60 61 if steps % 100 == 0: 62 print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%') 63 64 total_epoch_loss += loss.item() 65 total_epoch_acc += acc.item() 66 67 return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter) 68 69 def eval_model(model, val_iter): 70 total_epoch_loss = 0 71 total_epoch_acc = 0 72 model.eval() 73 with torch.no_grad(): 74 for idx, batch in enumerate(val_iter): 75 text = batch.text[0] 76 if (text.size()[0] is not 32): 77 continue 78 target = batch.label 79 # target = torch.autograd.Variable(target).long() 80 81 target = (target != 2) 82 target = target.long() 83 84 85 if torch.cuda.is_available(): 86 text = text.cuda() 87 target = target.cuda() 88 89 prediction = model(text) 90 loss = loss_fn(prediction, target) 91 num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).sum() 92 acc = 100.0 * num_corrects/len(batch) 93 total_epoch_loss += loss.item() 94 total_epoch_acc += acc.item() 95 96 return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter) 97 98 99 # learning_rate = 2e-5 100 # batch_size = 32 101 # output_size = 2 102 # hidden_size = 256 103 # embedding_length = 300 104 105 learning_rate = 1e-3 106 batch_size = 32 107 output_size = 1 108 # hidden_size = 256 109 embedding_length = 300 110 111 # model = LSTMClassifier(batch_size, output_size, hidden_size, vocab_size, embedding_length, word_embeddings) 112 113 model = CNN(batch_size = batch_size, output_size = 2, in_channels = 1, out_channels = 100, kernel_heights = [3,4,5], stride = 1, padding = 0, keep_probab = 0.5, vocab_size = vocab_size, embedding_length = 300, weights = word_embeddings) 114 115 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 116 model.to(device) 117 118 loss_fn = F.cross_entropy 119 120 for epoch in range(1): 121 train_loss, train_acc = train_model(model, train_iter, epoch) 122 val_loss, val_acc = eval_model(model, valid_iter) 123 124 print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%') 125 126 test_loss, test_acc = eval_model(model, test_iter) 127 print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}%') 128 129 ''' Let us now predict the sentiment on a single sentence just for the testing purpose. ''' 130 test_sen1 = "This is one of the best creation of Nolan. I can say, it's his magnum opus. Loved the soundtrack and especially those creative dialogues." 131 test_sen2 = "Ohh, such a ridiculous movie. Not gonna recommend it to anyone. Complete waste of time and money." 132 133 test_sen1 = TEXT.preprocess(test_sen1) 134 test_sen1 = [[TEXT.vocab.stoi[x] for x in test_sen1]] 135 136 test_sen2 = TEXT.preprocess(test_sen2) 137 test_sen2 = [[TEXT.vocab.stoi[x] for x in test_sen2]] 138 139 test_sen = np.asarray(test_sen2) 140 test_sen = torch.LongTensor(test_sen) 141 142 # test_tensor = Variable(test_sen, volatile=True) 143 144 # test_tensor = torch.tensor(test_sen, dtype= torch.long) 145 # test_tensor.new_tensor(test_sen, requires_grad = False) 146 test_tensor = test_sen.clone().detach().requires_grad_(False) 147 148 test_tensor = test_tensor.cuda() 149 150 model.eval() 151 output = model(test_tensor, 1) 152 output = output.cuda() 153 out = F.softmax(output, 1) 154 155 if (torch.argmax(out[0]) == 0): 156 print ("Sentiment: Positive") 157 else: 158 print ("Sentiment: Negative") View Code

[支付寶] Bless you~ O(∩_∩)O

As you start to walk out on the way, the way appears. ~Rumi

轉載于:https://www.cnblogs.com/shiyublog/p/11210504.html

總結

以上是生活随笔為你收集整理的[NLP-CNN] Convolutional Neural Networks for Sentence Classification -2014-EMNLP的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ITU-R BT601/BT709 BT
下一篇： C++总结笔记（十一）—— Lambda