日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 运维知识 > windows >内容正文

windows

【机器学习+NER】手把手教你用机器学习CRF模型构建NER系统(CCL2021)

發(fā)布時間:2023/12/31 windows 33 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【机器学习+NER】手把手教你用机器学习CRF模型构建NER系统(CCL2021) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

【機(jī)器學(xué)習(xí)+NER】手把手教你用機(jī)器學(xué)習(xí)CRF模型構(gòu)建NER系統(tǒng)(CCL2021)

數(shù)據(jù)集來源:2021年中文計算語言學(xué)研究大會的智能對話診療評測比賽

任務(wù):利用機(jī)器學(xué)習(xí)CRF模型構(gòu)建NER系統(tǒng),得到下圖所示的評估指標(biāo)

原始數(shù)據(jù)處理參考:醫(yī)療命名體識別之?dāng)?shù)據(jù)預(yù)處理(處理.json文件)

文章目錄

      • 【機(jī)器學(xué)習(xí)+NER】手把手教你用機(jī)器學(xué)習(xí)CRF模型構(gòu)建NER系統(tǒng)(CCL2021)
        • 一、環(huán)境搭建
        • 二、數(shù)據(jù)類型處理
        • 三、訓(xùn)練模型
        • 四、結(jié)果分析
        • 五、完整源碼

一、環(huán)境搭建

此處機(jī)器學(xué)習(xí)CRF模型主要使用sklearn_crfsuite庫調(diào)用CRF進(jìn)行搭建;

可以通過:

pip install sklearn_crfsuite

安裝sklearn_crfsuite庫

二、數(shù)據(jù)類型處理

  • 該數(shù)據(jù)集含有11類標(biāo)簽,分別為“0”、“B-Symptom”、“I-Symptom”、“B-Drug_Category”、“I-Drug_Category”、“B-Drug”、“I-Drug”、“B-Medical_Examination”、“I-Medical_Examination”、“B-Operation”、“I-Operation”給數(shù)據(jù)標(biāo)簽一個索引,即:# CCL2021數(shù)據(jù)標(biāo)簽: label2idx = {'O': 0,'B-Symptom': 1, 'I-Symptom': 2,'B-Drug_Category': 3, 'I-Drug_Category': 4,'B-Drug': 5, 'I-Drug': 6,'B-Medical_Examination': 7, 'I-Medical_Examination': 8,'B-Operation': 9, 'I-Operation': 10}
  • 將索引和標(biāo)簽一一對應(yīng),存儲到idx2label中idx2label = {idx: label for label, idx in label2idx.items()}
  • 讀取字符字典文件with open(char_vocab_path, "r", encoding="utf8") as fo:char_vocabs = [line.strip() for line in fo] char_vocabs = special_words + char_vocabs
  • 將字符與索引編號對應(yīng),便于后續(xù)查找字符idx2vocab = {idx: char for idx, char in enumerate(char_vocabs)} vocab2idx = {char: idx for idx, char in idx2vocab.items()}
  • 讀取訓(xùn)練語料,將原始數(shù)據(jù)7成劃分為訓(xùn)練集,3成劃分為測試集,返回數(shù)據(jù)和標(biāo)簽# 讀取訓(xùn)練語料 def read_corpus(corpus_path, vocab2idx, label2idx, flag):datas, labels = [], []with open(corpus_path, encoding='utf-8') as fr:lines = fr.readlines()sent_, tag_ = [], []if flag == "train":lines = lines[:int(len(lines) * 0.7)]else:lines = lines[int(len(lines) * 0.7):]for line in lines:if line != '\n':[char, label] = line.strip().split()sent_.append(char)tag_.append(label)else:sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]datas.append(sent_ids)labels.append(tag_ids)sent_, tag_ = [], []return datas, labels# 加載訓(xùn)練集 7成 train_datas, train_labels = read_corpus(train_data_path, vocab2idx, label2idx, flag="train") # 加載測試集 3成 test_datas, test_labels = read_corpus(train_data_path, vocab2idx, label2idx, flag="test")
  • 簡單測試一下,看數(shù)據(jù)是否對應(yīng)上了print(train_datas[8]) print([idx2vocab[idx] for idx in train_datas[8]]) print(train_labels[8]) print([idx2label[idx] for idx in train_labels[8]]) # 輸出結(jié)果為: #[1578, 5558, 2641, 5795, 2644, 3078, 2644, 939, 893, 3844, 3575, 946, 6821] #['寶', '貝', '最', '近', '有', '沒', '有', '嘔', '吐', '癥', '狀', '呢', '?'] #[0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0] #['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Symptom', 'I-Symptom', 'O', 'O', 'O', 'O'] #第一行數(shù)據(jù)為字符對應(yīng)的索引 #第二行為字符 #第三行為標(biāo)簽索引 #第四行為標(biāo)簽
  • 轉(zhuǎn)換數(shù)據(jù)格式,分別將數(shù)據(jù)與標(biāo)簽轉(zhuǎn)換成CRF模型支持的數(shù)據(jù)格式# 得到訓(xùn)練數(shù)據(jù)、訓(xùn)練數(shù)據(jù)標(biāo)簽 labels = [] datas = [] for i in range(len(train_labels)):datas.append([idx2vocab[idx] for idx in train_datas[i]])train_datas = datas # print(train_datas)for i in range(len(train_labels)):labels.append([idx2label[idx] for idx in train_labels[i]])train_labels = labels # print(train_labels)# 得到測試數(shù)據(jù)、測試數(shù)據(jù)標(biāo)簽 labels = [] datas = [] for i in range(len(test_labels)):datas.append([idx2vocab[idx] for idx in test_datas[i]])test_datas = datasfor i in range(len(test_labels)):labels.append([idx2label[idx] for idx in test_labels[i]])test_labels = labels
  • 三、訓(xùn)練模型

  • 利用sklearn_crfsuite庫調(diào)用CRF模型,默認(rèn)采用lbfgs算法crf = sklearn_crfsuite.CRF(algorithm='lbfgs',c1=0.1,c2=0.1,max_iterations=100,all_possible_transitions=True )
  • 利用訓(xùn)練數(shù)據(jù)和訓(xùn)練數(shù)據(jù)標(biāo)簽訓(xùn)練模型crf.fit(train_datas, train_labels)
  • 利用測試數(shù)據(jù),預(yù)測測試數(shù)據(jù)標(biāo)簽test_pred = crf.predict(test_datas)
  • 統(tǒng)計整個數(shù)據(jù)的標(biāo)簽信息,并進(jìn)行排序labels = list(crf.classes_) # labels.remove('O') sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0]) )
  • 為了滿足classification_report的數(shù)據(jù)格式要求,我們對測試集數(shù)據(jù)標(biāo)簽與預(yù)測結(jié)果標(biāo)簽進(jìn)行數(shù)據(jù)格式轉(zhuǎn)換,轉(zhuǎn)換成一位列表# 轉(zhuǎn)換成一維列表 label = [] pred = [] for i in range(len(test_labels)):for j in range(len(test_labels[i])):label.append(test_labels[i][j])test_labels = labelfor i in range(len(test_pred)):for j in range(len(test_pred[i])):pred.append(test_pred[i][j])test_pred = pred
  • 輸出分類結(jié)果報告,得到結(jié)果print(classification_report(test_labels, test_pred, target_names=sorted_labels ))
  • 四、結(jié)果分析

    從評估結(jié)果可以看出,模型對精準(zhǔn)率的宏平均值macro avg=0.90,加權(quán)平均值weighted avg=0.97。由此可以得出模型預(yù)測的正樣本中精確率達(dá)到90%以上;

    模型對于召回率recall的宏平均值macro avg=0.89,加權(quán)平均值weighted avg=0.97,因而模型對于正樣本的分類準(zhǔn)確率能達(dá)到89%%,加權(quán)平均值weighted avg=0.97說明大部分正樣本的分類準(zhǔn)確率為97%。

    模型對于F1-score的宏平均值macro avg=0.89,加權(quán)平均值weighted avg=0.97,說明模型的召回率和精準(zhǔn)率都很高,模型對于正樣本的預(yù)測效果和正樣本預(yù)測中的準(zhǔn)確性都很好。

    五、完整源碼

    import sklearn_crfsuite from sklearn.metrics import classification_reportchar_vocab_path = "./data/char_vocabs.txt" # 字典文件 train_data_path = "./data/train_data.txt" # 訓(xùn)練測試數(shù)據(jù) special_words = ['<PAD>', '<UNK>'] # 特殊詞表示 # CCL2021數(shù)據(jù)標(biāo)簽: label2idx = {'O': 0,'B-Symptom': 1, 'I-Symptom': 2,'B-Drug_Category': 3, 'I-Drug_Category': 4,'B-Drug': 5, 'I-Drug': 6,'B-Medical_Examination': 7, 'I-Medical_Examination': 8,'B-Operation': 9, 'I-Operation': 10}# 索引和BIO標(biāo)簽對應(yīng) idx2label = {idx: label for label, idx in label2idx.items()} # print(idx2label)# 讀取字符詞典文件 with open(char_vocab_path, "r", encoding="utf8") as fo:char_vocabs = [line.strip() for line in fo] char_vocabs = special_words + char_vocabs # print(char_vocabs)# 字符和索引編號對應(yīng) idx2vocab = {idx: char for idx, char in enumerate(char_vocabs)} vocab2idx = {char: idx for idx, char in idx2vocab.items()} # print(idx2vocab) # print(idx2vocab)# 讀取訓(xùn)練語料 def read_corpus(corpus_path, vocab2idx, label2idx, flag):datas, labels = [], []with open(corpus_path, encoding='utf-8') as fr:lines = fr.readlines()sent_, tag_ = [], []if flag == "train":lines = lines[:int(len(lines) * 0.7)]else:lines = lines[int(len(lines) * 0.7):]for line in lines:if line != '\n':[char, label] = line.strip().split()sent_.append(char)tag_.append(label)else:sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]datas.append(sent_ids)labels.append(tag_ids)sent_, tag_ = [], []return datas, labels# 加載訓(xùn)練集 7成 train_datas, train_labels = read_corpus(train_data_path, vocab2idx, label2idx, flag="train") # 加載測試集 3成 test_datas, test_labels = read_corpus(train_data_path, vocab2idx, label2idx, flag="test")# 輸出看數(shù)據(jù)是否對應(yīng)上了,"嘔吐" # print(train_datas[8]) # print([idx2vocab[idx] for idx in train_datas[8]]) # print(train_labels[8]) # print([idx2label[idx] for idx in train_labels[8]])# 得到訓(xùn)練數(shù)據(jù)、訓(xùn)練數(shù)據(jù)標(biāo)簽 labels = [] datas = [] for i in range(len(train_labels)):datas.append([idx2vocab[idx] for idx in train_datas[i]])train_datas = datas # print(train_datas)for i in range(len(train_labels)):labels.append([idx2label[idx] for idx in train_labels[i]])train_labels = labels # print(train_labels)# 得到測試數(shù)據(jù)、測試數(shù)據(jù)標(biāo)簽 labels = [] datas = [] for i in range(len(test_labels)):datas.append([idx2vocab[idx] for idx in test_datas[i]])test_datas = datasfor i in range(len(test_labels)):labels.append([idx2label[idx] for idx in test_labels[i]])test_labels = labels# 訓(xùn)練 crf = sklearn_crfsuite.CRF(algorithm='lbfgs',c1=0.1,c2=0.1,max_iterations=100,all_possible_transitions=True )crf.fit(train_datas, train_labels)labels = list(crf.classes_) # labels.remove('O')# 預(yù)測 test_pred = crf.predict(test_datas)sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0]) )# 轉(zhuǎn)換成一維數(shù)組 label = [] pred = [] for i in range(len(test_labels)):for j in range(len(test_labels[i])):label.append(test_labels[i][j])test_labels = labelfor i in range(len(test_pred)):for j in range(len(test_pred[i])):pred.append(test_pred[i][j])test_pred = predprint(classification_report(test_labels, test_pred, target_names=sorted_labels ))

    總結(jié)

    以上是生活随笔為你收集整理的【机器学习+NER】手把手教你用机器学习CRF模型构建NER系统(CCL2021)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。