當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

人工智能NLP自然语言之基础篇文本分类pytorch-transformers实现BERT文本分类bert

發(fā)布時(shí)間：2024/3/26 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了人工智能NLP自然语言之基础篇文本分类pytorch-transformers实现BERT文本分类bert 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一、數(shù)據(jù)集介紹
中文文本分類數(shù)據(jù)集
數(shù)據(jù)來源：
今日頭條客戶端
數(shù)據(jù)格式：

6554695793956094477_!_110_!_news_military_!_「歐洲第一陸軍」法蘭西帝國(guó)的歐陸霸權(quán)_!_查理八世,布列塔尼,卡佩王朝,佛蘭德斯,法國(guó) 6554855520291783175_!_110_!_news_military_!_以色列為巷戰(zhàn)而研發(fā)的重型裝甲運(yùn)兵車，美軍也租一輛進(jìn)行作戰(zhàn)測(cè)試_!_裝甲運(yùn)兵車,重型步兵戰(zhàn)車,步兵戰(zhàn)車,以色列,雌虎,M113,T-55 6525155156756005383_!_116_!_news_game_!_植物大戰(zhàn)僵尸僵尸治愈之旅原來僵尸也會(huì)玩螳螂黃雀之計(jì)_!_黃雀之計(jì),植物大戰(zhàn)僵尸 6525662251456659971_!_116_!_news_game_!_我的世界vs火柴人番外篇：小橙闖入了MC世界_!_我的世界,番外篇,火柴人 6554666063026455044_!_116_!_news_game_!_貪吃蛇大作戰(zhàn)：主宰全場(chǎng)，綠隊(duì)最后的反超，我還是最佳MVP_!_貪吃蛇 6539508345470976515_!_116_!_news_game_!_古普象棋：鐵滑車對(duì)戰(zhàn)雙正馬，單兵入花心，荊軻刺秦王_!_古普象棋,鐵滑車,花心,荊軻刺秦王 6545948385956856323_!_116_!_news_game_!_CF生存特訓(xùn)：“MK5-2”決賽圈，速度奔襲，敵人根本反應(yīng)不過來！_!_決賽圈,小粉 6550234982705529358_!_116_!_news_game_!_LOL東北大鵪鶉：后期拉克絲大招近乎0CD，高爆發(fā)高傷害，恐怖！_!_大招,拉克絲

每行為一條數(shù)據(jù)，以_!_分割的個(gè)字段，從前往后分別是新聞ID，分類code（見下文），分類名稱（見下文），新聞字符串（僅含標(biāo)題），新聞關(guān)鍵詞

分類code與名稱：

100 民生故事 news_story 101 文化文化 news_culture 102 娛樂娛樂 news_entertainment 103 體育體育 news_sports 104 財(cái)經(jīng) 財(cái)經(jīng) news_finance 106 房產(chǎn) 房產(chǎn) news_house 107 汽車汽車 news_car 108 教育教育 news_edu 109 科技科技 news_tech 110 軍事軍事 news_military 112 旅游旅游 news_travel 113 國(guó)際國(guó)際 news_world 114 證券股票 stock 115 農(nóng)業(yè) 三農(nóng) news_agriculture 116 電競(jìng) 游戲 news_game

數(shù)據(jù)規(guī)模：
共382688條，分布于15個(gè)分類中。
實(shí)驗(yàn)結(jié)果：
以80%、10%、10%做分割。

處理后的數(shù)據(jù)樣式：

有哪些偏冷門的歌曲推薦？ 0 “整容狂人”的審美，恕欣賞不來 0 吳卓林：你父母固然有責(zé)任，但最大的責(zé)任還是在于你自己！ 0 《天乩戰(zhàn)之白蛇傳說》趙雅芝和楊紫飾演母女會(huì)擦出什么樣的火花？ 0 他是最帥反派專業(yè)戶，演《古惑仔》大火，今病魔纏身可憐無人識(shí)！ 0 如果今年勇士奪冠，下賽季詹姆斯何去何從？ 1 超級(jí)替補(bǔ)！科斯塔本賽季替補(bǔ)出場(chǎng)貢獻(xiàn)7次助攻 1 騎士6天里發(fā)生了啥？從首輪搶七到次輪3-0猛龍 1 如果朗多進(jìn)入轉(zhuǎn)會(huì)市場(chǎng)，哪些球隊(duì)適合他？ 1 詹姆斯G3決殺，你怎么看？ 1

導(dǎo)入所需要的包：

import time import torch import numpy as np import warnings import torch.nn as nn from tqdm import tqdm from sklearn.metrics import accuracy_score, classification_report from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler from transformers import BertModel, BertConfig, BertTokenizer, AdamW, get_cosine_schedule_with_warmup warnings.filterwarnings('ignore')

超參數(shù)配置：

bert_path = "bert_model/s" # 該文件夾下存放三個(gè)文件（'vocab.txt', 'pytorch_model.bin', 'config.json'） tokenizer = BertTokenizer.from_pretrained(bert_path) # 初始化分詞器input_ids, input_masks, input_types, = [], [], [] # input char ids, segment type ids, attention mask labels = [] # 標(biāo)簽 maxlen = 128 EPOCHS = 300 BATCH_SIZE = 128 # 如果會(huì)出現(xiàn)OOM問題，減小它

數(shù)據(jù)處理部分：

with open("new_text.txt", 'r', encoding='utf-8') as f:for i in f:title, y = i.replace('\n', '').split(' ')[0], i.replace('\n', '').split(' ')[1]# encode_plus會(huì)輸出一個(gè)字典，分別為'input_ids', 'token_type_ids', 'attention_mask'對(duì)應(yīng)的編碼# 根據(jù)參數(shù)會(huì)短則補(bǔ)齊，長(zhǎng)則切斷encode_dict = tokenizer.encode_plus(text=title, max_length=maxlen,padding='max_length', truncation=True)input_ids.append(encode_dict['input_ids'])input_types.append(encode_dict['token_type_ids'])input_masks.append(encode_dict['attention_mask'])labels.append(int(y))input_ids, input_types, input_masks = np.array(input_ids), np.array(input_types), np.array(input_masks) labels = np.array(labels) print(input_ids.shape, input_types.shape, input_masks.shape, labels.shape)# 隨機(jī)打亂索引 idxes = np.arange(input_ids.shape[0]) np.random.seed(2019) # 固定種子 np.random.shuffle(idxes) print(idxes.shape, idxes[:10])# 8:1:1 劃分訓(xùn)練集、驗(yàn)證集、測(cè)試集 input_ids_train, input_ids_valid, input_ids_test = input_ids[idxes[:186959]], input_ids[idxes[186959:210329]], input_ids[idxes[210329:]] input_masks_train, input_masks_valid, input_masks_test = input_masks[idxes[:186959]], input_masks[idxes[186959:210329]], input_masks[idxes[210329:]] input_types_train, input_types_valid, input_types_test = input_types[idxes[:186959]], input_types[idxes[186959:210329]], input_types[idxes[210329:]]y_train, y_valid, y_test = labels[idxes[:186959]], labels[idxes[186959:210329]], labels[idxes[210329:]]print(input_ids_train.shape,y_train.shape,input_ids_valid.shape,y_valid.shape,input_ids_test.shape,y_test.shape )# 訓(xùn)練集 train_data = TensorDataset(torch.LongTensor(input_ids_train),torch.LongTensor(input_masks_train),torch.LongTensor(input_types_train),torch.LongTensor(y_train)) train_sampler = RandomSampler(train_data) train_loader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)# 驗(yàn)證集 valid_data = TensorDataset(torch.LongTensor(input_ids_valid),torch.LongTensor(input_masks_valid),torch.LongTensor(input_types_valid),torch.LongTensor(y_valid)) valid_sampler = SequentialSampler(valid_data) valid_loader = DataLoader(valid_data, sampler=valid_sampler, batch_size=BATCH_SIZE)# 測(cè)試集 test_data = TensorDataset(torch.LongTensor(input_ids_test),torch.LongTensor(input_masks_test),torch.LongTensor(input_types_test)) test_sampler = SequentialSampler(test_data) test_loader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

定義Model：

# 定義model class Bert_Model(nn.Module):def __init__(self, bert_path, classes=10):super(Bert_Model, self).__init__()self.config = BertConfig.from_pretrained(bert_path) # 導(dǎo)入模型超參數(shù)self.bert = BertModel.from_pretrained(bert_path) # 加載預(yù)訓(xùn)練模型權(quán)重self.fc = nn.Linear(self.config.hidden_size, classes) # 直接分類def forward(self, input_ids, attention_mask=None, token_type_ids=None):outputs = self.bert(input_ids, attention_mask, token_type_ids)out_pool = outputs[1] # 池化后的輸出 logit = self.fc(out_pool) return logit

優(yōu)化器：

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(DEVICE) model = Bert_Model(bert_path).to(DEVICE)optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=1e-4) #AdamW優(yōu)化器 scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=len(train_loader),num_training_steps=EPOCHS*len(train_loader))

訓(xùn)練與評(píng)估模型：

def train_and_eval(model, train_loader, valid_loader,optimizer, scheduler, device, epoch):best_acc = 0.0criterion = nn.CrossEntropyLoss()for i in range(epoch):"""訓(xùn)練模型"""start = time.time()model.train()print("****** Running training epoch {} ******".format(i + 1))train_loss_sum = 0.0for idx, (ids, att, tpe, y) in enumerate(train_loader):ids, att, tpe, y = ids.to(device), att.to(device), tpe.to(device), y.to(device)y_pred = model(ids, att, tpe)loss = criterion(y_pred, y)optimizer.zero_grad()loss.backward()optimizer.step()scheduler.step() # 學(xué)習(xí)率變化train_loss_sum += loss.item()if (idx + 1) % (len(train_loader) // 5) == 0: # 只打印五次結(jié)果print("Epoch {:04d} | Step {:04d}/{:04d} | Loss {:.4f} | Time {:.4f}".format(i + 1, idx + 1, len(train_loader), train_loss_sum / (idx + 1), time.time() - start))"""驗(yàn)證模型"""model.eval()acc = evaluate(model, valid_loader, device) # 驗(yàn)證模型的性能## 保存最優(yōu)模型if acc > best_acc:best_acc = acctorch.save(model.state_dict(), "best_model.pth")print("current acc is {:.4f}, best acc is {:.4f}".format(acc, best_acc))print("time costed = {}s \n".format(round(time.time() - start, 5)))

評(píng)估模型性能：

def evaluate(model, data_loader, device):model.eval()val_true, val_pred = [], []with torch.no_grad():for idx, (ids, att, tpe, y) in (enumerate(data_loader)):y_pred = model(ids.to(device), att.to(device), tpe.to(device))y_pred = torch.argmax(y_pred, dim=1).detach().cpu().numpy().tolist()val_pred.extend(y_pred)val_true.extend(y.squeeze().cpu().numpy().tolist())return accuracy_score(val_true, val_pred) # 返回accuracy

預(yù)測(cè)：

def predict(model, data_loader, device):model.eval()val_pred = []with torch.no_grad():for idx, (ids, att, tpe) in tqdm(enumerate(data_loader)):y_pred = model(ids.to(device), att.to(device), tpe.to(device))y_pred = torch.argmax(y_pred, dim=1).detach().cpu().numpy().tolist()val_pred.extend(y_pred)return val_pred # 訓(xùn)練和評(píng)估 train_and_eval(model, train_loader, valid_loader, optimizer, scheduler, DEVICE, EPOCHS)# 加載最優(yōu)權(quán)重 model.load_state_dict(torch.load("best_model.pth")) pred_test = predict(model, test_loader, DEVICE) print("\n Test Accuracy = {} \n".format(accuracy_score(y_test, pred_test))) print(classification_report(y_test, pred_test, digits=4))

訓(xùn)練過程：

預(yù)測(cè)結(jié)果：

數(shù)據(jù)集下載：
鏈接：https://pan.baidu.com/s/1JrYI6mEp0DFtDyYxDgHrow
提取碼：p9yh

總結(jié)

以上是生活随笔為你收集整理的人工智能NLP自然语言之基础篇文本分类pytorch-transformers实现BERT文本分类bert的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Sping、SpringMVC框架教程
下一篇： NFC数据串口传输模块（NFC2COM）

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

人工智能NLP自然语言之基础篇文本分类pytorch-transformers实现BERT文本分类bert

總結(jié)