日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【NLP_命名实体识别】Albert+BiLSTM+CRF模型训练、评估与使用

發(fā)布時(shí)間:2023/12/31 编程问答 39 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【NLP_命名实体识别】Albert+BiLSTM+CRF模型训练、评估与使用 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

模型訓(xùn)練

2021/3/10:使用訓(xùn)練好的Bert/Albert-CRF模型,同時(shí),在此基礎(chǔ)上,加一層BiLSTM網(wǎng)絡(luò),得修改后的Albert-BiLSTM-CRF模型(見下一篇文章),開始訓(xùn)練。

修改思路:以已有的Albert+CRF模型代碼為基礎(chǔ),參考網(wǎng)上的Albert+BiLSTM+CRF模型,稍加修改即可。值得注意的,無非是“三種模型”之間的數(shù)據(jù)傳遞類型,比如,將Albert模型訓(xùn)練得到的embedding,傳入BiLSTM(參考:ALBERT+BiLSTM+CRF實(shí)現(xiàn)序列標(biāo)注 - 光彩照人 - 博客園)。

調(diào)試過程:其間,多次用到命令行,安裝需要的庫、工具包,按部就班去做即可。

import numpy as np from bert4keras.backend import keras, K from bert4keras.models import build_transformer_model from bert4keras.tokenizers import Tokenizer from bert4keras.optimizers import Adam from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open, ViterbiDecoder from bert4keras.layers import ConditionalRandomField from keras.layers import Dense from keras.models import Model from tqdm import tqdm from tensorflow import ConfigProto from tensorflow import InteractiveSession from numpy import array from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import Bidirectional from keras.layers import Dropout from keras.layers import TimeDistributed from keras_contrib.layers import CRF from keras_contrib.losses import crf_loss from keras_contrib.metrics import crf_accuracy, crf_viterbi_accuracyconfig = ConfigProto() # config.gpu_options.per_process_gpu_memory_fraction = 0.2 config.gpu_options.allow_growth = True session = InteractiveSession(config=config)maxlen = 256 epochs = 1#10 batch_size = 16 bert_layers = 12 learing_rate = 1e-5 # bert_layers越小,學(xué)習(xí)率應(yīng)該要越大 crf_lr_multiplier = 10 # 必要時(shí)擴(kuò)大CRF層的學(xué)習(xí)率#1000# # bert配置 # config_path = './bert_model/chinese_L-12_H-768_A-12/bert_config.json' # checkpoint_path = './bert_model/chinese_L-12_H-768_A-12/bert_model.ckpt' # dict_path = './bert_model/chinese_L-12_H-768_A-12/vocab.txt'#albert配置 config_path = './bert_model/albert_large/albert_config.json' checkpoint_path = './bert_model/albert_large/model.ckpt-best' dict_path = './bert_model/albert_large/vocab_chinese.txt'def load_data(filename):D = []with open(filename, encoding='utf-8') as f:f = f.read()for l in f.split('\n\n'):if not l:continued, last_flag = [], ''for c in l.split('\n'):char, this_flag = c.split(' ')if this_flag == 'O' and last_flag == 'O':d[-1][0] += charelif this_flag == 'O' and last_flag != 'O':d.append([char, 'O'])elif this_flag[:1] == 'B':d.append([char, this_flag[2:]])else:d[-1][0] += charlast_flag = this_flagD.append(d)return D# 標(biāo)注數(shù)據(jù) train_data = load_data('./data/example.train') valid_data = load_data('./data/example.dev') test_data = load_data('./data/example.test')# 建立分詞器 tokenizer = Tokenizer(dict_path, do_lower_case=True)# 類別映射 labels = ['PER', 'LOC', 'ORG'] id2label = dict(enumerate(labels)) label2id = {j: i for i, j in id2label.items()} num_labels = len(labels) * 2 + 1class data_generator(DataGenerator):"""數(shù)據(jù)生成器"""def __iter__(self, random=False):batch_token_ids, batch_segment_ids, batch_labels = [], [], []for is_end, item in self.sample(random):token_ids, labels = [tokenizer._token_start_id], [0]for w, l in item:w_token_ids = tokenizer.encode(w)[0][1:-1]if len(token_ids) + len(w_token_ids) < maxlen:token_ids += w_token_idsif l == 'O':labels += [0] * len(w_token_ids)else:B = label2id[l] * 2 + 1I = label2id[l] * 2 + 2labels += ([B] + [I] * (len(w_token_ids) - 1))else:breaktoken_ids += [tokenizer._token_end_id]labels += [0]segment_ids = [0] * len(token_ids)batch_token_ids.append(token_ids)batch_segment_ids.append(segment_ids)batch_labels.append(labels)if len(batch_token_ids) == self.batch_size or is_end:batch_token_ids = sequence_padding(batch_token_ids)batch_segment_ids = sequence_padding(batch_segment_ids)batch_labels = sequence_padding(batch_labels)yield [batch_token_ids, batch_segment_ids], batch_labelsbatch_token_ids, batch_segment_ids, batch_labels = [], [], []""" 后面的代碼使用的是bert類型的模型,如果你用的是albert,那么前幾行請(qǐng)改為: """ model = build_transformer_model(config_path,checkpoint_path,model='albert', ) output_layer = 'Transformer-FeedForward-Norm' albert_output = model.get_layer(output_layer).get_output_at(bert_layers - 1)lstm = Bidirectional(LSTM(units=128, return_sequences=True), name="bi_lstm")(albert_output) drop = Dropout(0.1, name="dropout")(lstm) dense = TimeDistributed(Dense(num_labels, activation="softmax"), name="time_distributed")(drop)output = Dense(num_labels)(dense) CRF = ConditionalRandomField(lr_multiplier=crf_lr_multiplier) output = CRF(output)model = Model(model.input, output) model.summary()model.compile(loss=CRF.sparse_loss,optimizer=Adam(learing_rate),metrics=[CRF.sparse_accuracy] )class NamedEntityRecognizer(ViterbiDecoder):"""命名實(shí)體識(shí)別器"""def recognize(self,text):tokens = tokenizer.tokenize(text)while len(tokens) > 512:tokens.pop(-2)mapping = tokenizer.rematch(text, tokens)token_ids = tokenizer.tokens_to_ids(tokens)segment_ids = [0] * len(token_ids)nodes = model.predict([[token_ids], [segment_ids]])[0]labels = self.decode(nodes)entities, starting = [], Falsefor i, label in enumerate(labels):if label > 0:if label % 2 == 1:starting = Trueentities.append([[i], id2label[(label - 1) // 2]])elif starting:entities[-1][0].append(i)else:starting = Falseelse:starting = Falsereturn [(text[mapping[w[0]][0]:mapping[w[-1]][-1] + 1], l)for w, l in entities]def evaluate(data):"""評(píng)測(cè)函數(shù)"""X, Y, Z = 1e-10, 1e-10, 1e-10for d in tqdm(data):text = ''.join([i[0] for i in d])R = set(NER.recognize(text))T = set([tuple(i) for i in d if i[1] != 'O'])X += len(R & T)Y += len(R)Z += len(T)f1, precision, recall = 2 * X / (Y + Z), X / Y, X / Zreturn f1, precision, recallclass Evaluate(keras.callbacks.Callback):def __init__(self):self.best_val_f1 = 0def on_epoch_end(self, epoch, logs=None):trans = K.eval(CRF.trans)NER.trans = transprint(NER.trans)f1, precision, recall = evaluate(valid_data)# 保存最優(yōu)if f1 >= self.best_val_f1:self.best_val_f1 = f1model.save_weights('best_model.weights')print('valid: f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\n' %(f1, precision, recall, self.best_val_f1))f1, precision, recall = evaluate(test_data)print('test: f1: %.5f, precision: %.5f, recall: %.5f\n' %(f1, precision, recall))if __name__ == '__main__':evaluator = Evaluate()train_generator = data_generator(train_data, batch_size)model.fit_generator(train_generator.forfit(),steps_per_epoch=len(train_generator),epochs=epochs,callbacks=[evaluator])else:model.load_weights('best_model.weights')

模型評(píng)估

2021/3/11:今早,查看Albert+BiLSTM+CRF模型運(yùn)行結(jié)果,發(fā)現(xiàn)其精度很低,僅為0.8左右。然而,使用同樣的數(shù)據(jù),Albert+CRF模型精度在0.95以上。→→→思考其中原因,嘗試調(diào)整代碼:①嘗試調(diào)整LSTM相關(guān)參數(shù)(dropout),甚至去除dropout,皆無改善。②嘗試去除dropout與。dropout的作用?防止模型過擬合,但我認(rèn)為,其使用需要看場(chǎng)景,參考:為什么模型加入dropout層后變得更差了?最后dense層的作用?我認(rèn)為,可以將其理解為分類輸出層,因此模型中有CRF用于輸出轉(zhuǎn)換,故可能不需要dens層。參考:LSTM模型后增加Dense(全連接)層的目的是什么??→→→去除下面代碼后兩行后,Albert+BiLSTM+CRF模型精度在0.95以上。至于模型原理,待深究。

lstm = Bidirectional(LSTM(units=128, return_sequences=True), name="bi_lstm")(albert_output) #drop = Dropout(0.2, name="dropout")(lstm) #dense = TimeDistributed(Dense(num_labels, activation="softmax"), name="time_distributed")(drop)

讀寫文件

2021/3/12:上午,一直在嘗試Python讀寫文件,如此簡單之事,竟耗費(fèi)我兩小時(shí)之久。原因:總是報(bào)錯(cuò)'open' object has no attribute 'readlines'。解決思路:新建一個(gè)py文件,在里面進(jìn)行讀寫操作,可行。然而,同樣的語句,在Albert+BiLSTM+CRF模型py文件中,不可行。→這說明,語句本身沒錯(cuò),可能是Albert+BiLSTM+CRF模型py文件中變量/函數(shù)等名稱與讀寫語句沖突。→的確如此,Albert+BiLSTM+CRF模型py文件的開頭,有“from bert4keras.snippets import?open, ViterbiDecoder”,此"open"非彼"open"。

model.load_weights('best_model.weights') NER = NamedEntityRecognizer(trans=K.eval(CRF.trans), starts=[0], ends=[0])r = open("D:\Asian elephant\gao A_geography_NER\A_geography_NER\data\\result.txt", 'w') with open("D:\Asian elephant\gao A_geography_NER\A_geography_NER\data\\t.txt",'r',encoding='utf-8') as tt:content = tt.readlines() for line in content:ner=NER.recognize(line)print(ner,file=r)

模型訓(xùn)練

2021/3/14:訓(xùn)練模型(迭代3次,學(xué)習(xí)率設(shè)為1000,其他參數(shù)設(shè)置如下)。

訓(xùn)練數(shù)據(jù):現(xiàn)有標(biāo)注數(shù)據(jù)集+自己標(biāo)注的數(shù)據(jù);測(cè)試數(shù)據(jù):自己標(biāo)注的數(shù)據(jù);驗(yàn)證數(shù)據(jù):自己標(biāo)注的數(shù)據(jù)。

耗時(shí):純CPU,迭代一次大約需要7小時(shí)。

結(jié)果LOW:epoch 1 →1304/1304:loss: 3.9929 - sparse_accuracy: 0.9648,test: ?f1: 0.13333, precision: 0.41176, recall: 0.07955,valid: ?f1: 0.15493, precision: 0.64706, recall: 0.08800, best f1: 0.15493

epoch 2→1304/1304:loss: 0.5454 - sparse_accuracy: 0.9849,test: ?f1: 0.25455, precision: 0.63636, recall: 0.15909,valid: ?f1: 0.18919, precision: 0.60870, recall: 0.11200, best f1: 0.18919

epoch 3→test與valid的precision達(dá)0.7以上

maxlen = 256 #文本保留的最大長度 epochs = 3 #迭代次數(shù) batch_size = 16 #訓(xùn)練時(shí),每次傳入模型的特征數(shù)量 bert_layers = 12 learing_rate = 1e-5 # bert_layers越小,學(xué)習(xí)率應(yīng)該要越大 crf_lr_multiplier = 1000 # 必要時(shí)擴(kuò)大CRF層的學(xué)習(xí)率#1000

各種bug及其解決

ValueError: substring not found

bug之“substring not found”

2021/3/5:問題:迭代三次的模型已訓(xùn)練完畢,但將所有數(shù)據(jù)放入模型時(shí),得到上述bug。解決:解決bug并不難,甚至無需了解其原理,只需進(jìn)行比對(duì)——多試幾次,發(fā)現(xiàn)數(shù)據(jù)中報(bào)錯(cuò)行的規(guī)律。本以為是標(biāo)點(diǎn)符號(hào)的問題,但排查過后,了解到,是字母的問題。

ValueError: not enough values to unpack (expected?2,?got?1)

2021/3/13:問題:其他設(shè)置一致,僅是使用的數(shù)據(jù)不同,精度結(jié)果卻大相徑庭。使用Albert+BiLSTM+CRF模型代碼包自帶的訓(xùn)練數(shù)據(jù)用于訓(xùn)練模型,使用自己標(biāo)注的少量數(shù)據(jù)用于測(cè)試與驗(yàn)證,得到較好的結(jié)果;但在訓(xùn)練數(shù)據(jù)中,加上自己標(biāo)注的少量數(shù)據(jù),一起用于訓(xùn)練,卻得到很差的結(jié)果。解決:仍是,找不同。我標(biāo)注的數(shù)據(jù)與原數(shù)據(jù)有何不同?答:是否有'\n',這“不起眼”的'\n',卻有很重要的作用(如下)。

2021/3/15:新增一些自己標(biāo)注的數(shù)據(jù),而后,程序又報(bào)錯(cuò)。錯(cuò)誤原因:類似于2021/3/13那次報(bào)錯(cuò)原因,仍是數(shù)據(jù)里的格式問題(字符/空格/換行符多余或缺失),但本次錯(cuò)誤更為細(xì)致——文件末尾兩個(gè)換行符的缺失,而這兩個(gè)換行符十分重要(見代碼中的for l in f.split('\n\n'): #查找雙換行符)。解決方案:仍是對(duì)比正確數(shù)據(jù)VS我的報(bào)錯(cuò)數(shù)據(jù),①以為是數(shù)據(jù)中空格的問題(上次是此原因報(bào)錯(cuò)),就一直糾結(jié)空格;②對(duì)比的所謂“正確數(shù)據(jù)”并非原始的、真正正確的數(shù)據(jù),導(dǎo)致遲遲未能解決。

def load_data(filename): #加載標(biāo)注數(shù)據(jù):訓(xùn)練集、測(cè)試集與驗(yàn)證集D = [] with open(filename, encoding='utf-8') as f: #打開并讀取文件內(nèi)容f = f.read() #讀取文件全部內(nèi)容for l in f.split('\n\n'): #查找雙換行符if not l: #若無雙換行符continue #跳出本次循環(huán),可執(zhí)行下一次 (而break是跳出整個(gè)循環(huán))d, last_flag = [], '' for c in l.split('\n'): #查找換行符char, this_flag = c.split(' ') if this_flag == 'O' and last_flag == 'O': d[-1][0] += char elif this_flag == 'O' and last_flag != 'O':d.append([char, 'O'])elif this_flag[:1] == 'B': #從索引0開始取,到1,但不包括1(即標(biāo)注首字母為B)d.append([char, this_flag[2:]]) #從索引2開始取,char豎著到最后,如“梁子老寨”每個(gè)字的標(biāo)注都非O,輸出('梁子老寨', 'LOC')。else:d[-1][0] += char #若無換行符,last_flag = this_flagD.append(d)return D #結(jié)果格式:[('良子', 'LOC'), ('勐乃通達(dá)', 'LOC'), ('梁子老寨', 'LOC'), ('黑山', 'LOC'), ('黑山', 'LOC'), ('勐乃通達(dá)', 'LOC')]

總結(jié)

以上是生活随笔為你收集整理的【NLP_命名实体识别】Albert+BiLSTM+CRF模型训练、评估与使用的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。