當前位置：首頁 >

对Bert的理解

發布時間：2023/12/14 30 豆豆

生活随笔收集整理的這篇文章主要介紹了对Bert的理解小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這里寫自定義目錄標題

- 句子情感分類
- - - 每個預測值是怎么計算出來的？
    - 下面討論代碼的實現：
    - 下面重點
    - Mode#1
    - Model#2 Train/Test Split

最近看了機器之心的對bert源碼的解讀，做如下筆記：

句子情感分類

首先是整個流程，主要分為兩部分：

對句子進行處理，我理解的類似于embedding，類似于Word2Vec。
外接模型，后續可以加Logistics模型，LSTM模型等。

首先對模型進行詞嵌入：

之后用Scikit Learn庫進行訓練集、測試集的劃分

每個預測值是怎么計算出來的？

假設我們有句子[ a visually stunning rumination on love ] ，要對這個句子進行情感分類，第一步就是用BERT分詞器tokenizer將單詞word分成詞token（這里好像對中文而言？），然后在開始和結束加入**[CLS]** 和 [SEP]

這個過程只需要一行代碼就能完成

tokenizer.encode("a visually stunning rumination on love", add_special_tokens=True)

??? 那么這個tokenizer是哪里來的呢？這個我們下面再說

然后我們得到的數據就可以傳給BERT了

最后整個模型可以想象成這個過程：

下面討論代碼的實現：

import numpy as np import pandas as pd import torch import transformers as ppb # pytorch transformers from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split

1.導入數據
數據集的鏈接如下：https://github.com/clairett/pytorch-sentiment-classification/。我們可以直接將其導入為一個 pandas 數據幀。

df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

我們看一下數據的格式

我們在看一下數據的分布：

2. 加載pre_trained BERT model

# For DistilBERT: model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')## Want BERT instead of distilBERT? Uncomment the following line: #model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')# Load pretrained model/tokenizer tokenizer = tokenizer_class.from_pretrained(pretrained_weights) model = model_class.from_pretrained(pretrained_weights)

從這里我們可以看出： tokenizer是直接從tokenizer_class 中拿出來的，而tokenizer_class 是transformers的一個工具類，model也是transformers中的工具類。
在colab中由于算力的限制，我們只取了前2000個樣本。即batch_1

下面重點

Tokenization
把句子分成BERT的輸入格式——token、CLS、SEP，（還有個mask機制，這里需要再詳細了解一下）tokenized = batch[0]_1.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)

Padding

分詞后，我們有了一個list格式的數據，但是長短不一，我們需要對數據進行一下padding，把不夠的補0.
具體代碼：

max_len = 0 for i in tokenized.values:if len(i) > max_len:max_len = len(i)padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

3. Mask
還需要對數據進行一個Mask操作，因為否則的話會對模型造成一定的混淆，具體操作就是新建一個矩陣attention_mask，值為0或1，在padded值不為的地方為1，為0的地方為0；

attention_mask = np.where(padded != 0, 1, 0) attention_mask.shape

Mode#1

現在我們有的padded矩陣，就是對原始的token矩陣進行補齊，還有attention_mask矩陣。，把他們轉換成torch.tensor的形式。

input_ids = torch.tensor(padded) attention_mask = torch.tensor(attention_mask)

直接輸入給model，獲得經過BERT之后的矩陣！！

with torch.no_grad(): last_hidden_states = model(input_ids, attention_mask=attention_mask)

結果如圖：

提取第0個position的數據

features = last_hidden_states[0][:,0,:].numpy() labels = batch_1[1]

Model#2 Train/Test Split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels) lr_clf = LogisticRegression() lr_clf.fit(train_features, train_labels) lr_clf.score(test_features, test_labels)

最后得分0.856