bert使用做文本分类_使用BERT进行深度学习的多类文本分类
bert使用做文本分類
Most of the researchers submit their research papers to academic conference because its a faster way of making the results available. Finding and selecting a suitable conference has always been challenging especially for young researchers.
大多數(shù)研究人員將其研究論文提交學(xué)術(shù)會議,因為這是一種更快獲得結(jié)果的方法。 尋找和選擇合適的會議一直是挑戰(zhàn),尤其是對于年輕的研究人員而言。
However, based on the previous conferences proceeding data, the researchers can increase their chances of paper acceptance and publication. We will try to solve this text classification problem with deep learning using BERT.
但是,根據(jù)以前的會議進(jìn)行的數(shù)據(jù),研究人員可以增加論文被接受和發(fā)表的機(jī)會。 我們將嘗試通過使用BERT進(jìn)行深度學(xué)習(xí)來解決此文本分類問題。
Almost all the code were taken from this tutorial, the only difference is the data.
幾乎所有代碼都來自本教程 ,唯一的區(qū)別是數(shù)據(jù)。
數(shù)據(jù) (The Data)
The dataset contains 2,507 research paper titles, and have been manually classified into 5 categories (i.e. conferences) that can be downloaded from here.
該數(shù)據(jù)集包含2,507個研究論文標(biāo)題,并且已被手動分為5類(即會議),可以從此處下載。
探索和預(yù)處理 (Explore and Preprocess)
conf_explore.pyconf_explore.py Table 1表格1 df['Conference'].value_counts()Figure 1圖1You may have noticed that our classes are imbalanced, and we will address this later on.
您可能已經(jīng)注意到我們的班級不平衡,我們將在稍后解決。
編碼標(biāo)簽 (Encoding the Labels)
label_encoding.pylabel_encoding.py df['label'] = df.Conference.replace(label_dict)訓(xùn)練和驗證拆分 (Train and Validation Split)
Because the labels are imbalanced, we split the data set in a stratified fashion, using this as the class labels.
由于標(biāo)簽不平衡,因此我們將數(shù)據(jù)集分層使用此標(biāo)簽作為類標(biāo)簽。
Our labels distribution will look like this after the split.
拆分后,我們的標(biāo)簽分布將如下所示。
train_test_split.pytrain_test_split.py Figure 2圖2BertTokenizer和數(shù)據(jù)編碼 (BertTokenizer and Encoding the Data)
Tokenization is a process to take raw texts and split into tokens, which are numeric data to represent words.
令牌化是獲取原始文本并將其拆分為令牌的過程,令牌是表示單詞的數(shù)字?jǐn)?shù)據(jù)。
Constructs a BERT tokenizer. Based on WordPiece.
構(gòu)造一個BERT令牌生成器 。 基于WordPiece。
- Instantiate a pre-trained BERT model configuration to encode our data. 實例化預(yù)訓(xùn)練的BERT模型配置以對我們的數(shù)據(jù)進(jìn)行編碼。
To convert all the titles from text into encoded form, we use a function called batch_encode_plus , and we will proceed train and validation data separately.
要將所有標(biāo)題從文本轉(zhuǎn)換為編碼形式,我們使用一個名為batch_encode_plus的函數(shù),我們將分別進(jìn)行訓(xùn)練和驗證數(shù)據(jù)。
- The 1st parameter inside the above function is the title text. 上面函數(shù)中的第一個參數(shù)是標(biāo)題文本。
add_special_tokens=True means the sequences will be encoded with the special tokens relative to their model.
add_special_tokens=True表示序列將使用相對于其模型的特殊標(biāo)記進(jìn)行編碼。
When batching sequences together, we set return_attention_mask=True, so it will return the attention mask according to the specific tokenizer defined by the max_length attribute.
將序列分批處理時,我們設(shè)置return_attention_mask=True ,因此它將根據(jù)max_length屬性定義的特定標(biāo)記生成器返回注意掩碼。
- We also want to pad all the titles to certain maximum length. 我們還希望將所有標(biāo)題填充到一定的最大長度。
We actually do not need to set max_length=256, but just to play it safe.
實際上,我們不需要設(shè)置max_length=256 ,而是為了安全起見。
return_tensors='pt' to return PyTorch.
return_tensors='pt'返回PyTorch。
And then we need to split the data into input_ids, attention_masks and labels.
然后,我們需要將數(shù)據(jù)拆分為input_ids , attention_masks和labels 。
- Finally, after we get encoded data set, we can create training data and validation data. 最后,在獲得編碼數(shù)據(jù)集之后,我們可以創(chuàng)建訓(xùn)練數(shù)據(jù)和驗證數(shù)據(jù)。
BERT預(yù)訓(xùn)練模型 (BERT Pre-trained Model)
We are treating each title as its unique sequence, so one sequence will be classified to one of the five labels (i.e. conferences).
我們將每個標(biāo)題視為其唯一序列,因此一個序列將被分類為五個標(biāo)簽(即會議)之一。
bert-base-uncased is a smaller pre-trained model.
bert-base-uncased是一個較小的預(yù)訓(xùn)練模型。
Using num_labels to indicate the number of output labels.
使用num_labels指示輸出標(biāo)簽的數(shù)量。
We don’t really care about output_attentions.
我們并不在乎output_attentions 。
We also don’t need output_hidden_states.
我們也不需要output_hidden_states 。
數(shù)據(jù)加載器 (Data Loaders)
DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset.
DataLoader結(jié)合了數(shù)據(jù)集和采樣器,并在給定的數(shù)據(jù)集上提供了可迭代的功能。
We use RandomSampler for training and SequentialSampler for validation.
我們使用RandomSampler進(jìn)行訓(xùn)練,使用SequentialSampler進(jìn)行驗證。
Given the limited memory in my environment, I set batch_size=3.
考慮到我的環(huán)境中有限的內(nèi)存,我設(shè)置batch_size=3 。
優(yōu)化器和調(diào)度器 (Optimizer & Scheduler)
- To construct an optimizer, we have to give it an iterable containing the parameters to optimize. Then, we can specify optimizer-specific options such as the learning rate, epsilon, etc. 要構(gòu)建優(yōu)化器,我們必須給它一個包含可優(yōu)化參數(shù)的可迭代器。 然后,我們可以指定特定于優(yōu)化程序的選項,例如學(xué)習(xí)率,ε等。
I found epochs=5 works well for this data set.
我發(fā)現(xiàn)epochs=5對于此數(shù)據(jù)集效果很好。
- Create a schedule with a learning rate that decreases linearly from the initial learning rate set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial learning rate set in the optimizer. 在預(yù)熱階段之后,創(chuàng)建一個計劃,使其學(xué)習(xí)率從優(yōu)化器中設(shè)置的初始學(xué)習(xí)率線性降低到0,在此期間,學(xué)習(xí)率從0線性增加到優(yōu)化器中設(shè)置的初始學(xué)習(xí)率。
性能指標(biāo) (Performance Metrics)
We will use f1 score and accuracy per class as performance metrics.
我們將使用f1分?jǐn)?shù)和每個班級的準(zhǔn)確性作為績效指標(biāo)。
performance_metrics.pyperformance_metrics.py訓(xùn)練循環(huán) (Training Loop)
training_loop.pytraining_loop.py Figure 3圖3加載和評估模型 (Loading and Evaluating the Model)
loading_evaluating.pyloading_evaluating.py Figure 4圖4Jupyter notebook can be found on Github. Enjoy the rest of the weekend!
Jupyter筆記本可以在Github上找到。 周末愉快!
翻譯自: https://towardsdatascience.com/multi-class-text-classification-with-deep-learning-using-bert-b59ca2f5c613
bert使用做文本分類
總結(jié)
以上是生活随笔為你收集整理的bert使用做文本分类_使用BERT进行深度学习的多类文本分类的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 抵押贷款可以吗,当然可以
- 下一篇: 深度学习之对象检测_深度学习时代您应该阅