當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CRF用于命名实体识别（快速上手实现）

發布時間：2023/12/8 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 CRF用于命名实体识别（快速上手实现）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

寫在前面

最近在看命名實體識別相關的模型，實驗室正好有中醫典籍文本的命名實體標注數據集，拿來練練構建一個簡單的CRF模型，順便記錄下來，代碼可以作為一個參考，手中有標注數據集就可以使用這段代碼來訓練自己的CRF模型。本次實驗用到了sklearn_crfsuite庫，這是一個輕量級的CRF庫，不僅提供了訓練預測方法，還提供了評估方法。數據集的格式大致如下圖所示：

每行包含一個字和對應的標注，用空行來分隔開每句話。采用了四個符號（B、I、O、S），分別表示實體的起始字、實體的剩余部分、非實體、單字實體。但語料文本中還有一個E符號，表示實體的結束，感覺意義不大，為了降低復雜度，就將所有的E符號轉換為“I”。

數據預處理

這部分要為后面的模型來準備數據，包括特征和標注序列。需要實現以下函數：

（1）讀取所有語料數據并返回

def __init__(self):self.file_path="中醫語料.txt"def read_file(self):f=open(self.file_path,encoding="utf-8")line=f.readlines()f.close()return line

（2）將每句話放到一個list中，形成一個二維列表。

def pre_process(self):lines=self.read_file()new_lines=[]list = []for line in lines:line=line.strip()if len(line) == 0:new_lines.append(list)list=[]else:list.append(line)return new_lines

（3）將每句話的字序列和標記序列分別存儲到兩個二維序列中，這里的word_list就是上面pre_process函數返回的二維列表。并且在每句話的前后加上“BOS”和“EOS”標記。

def init(self,word_list):self.word_seq=[[u'<BOS>']+[word.split(" ")[0] for word in words]+[u'<EOS>'] for words in word_list]self.tag_seq=[[word.split(" ")[1].replace("E","I") for word in words] for words in word_list]

在此次實驗中，特征用的是簡單的N-gram模型，所以要實現一個滑動窗口和特征抽取函數。

（4）實現滑動窗口函數，每三個字形成一個片段gram。這里的word_list是上面init函數生成的字序列。

def segment_by_window(self,word_list):words=[]begin,end=0,3for _ in range(1,len(word_list)):if end >len(word_list):breakwords.append(word_list[begin:end])begin+=1end+=1return words

（5）特征抽取函數，使用每句話的字序列生成的gram，利用tri-gram模型抽取特征。

def extract_features(self,word_grams):features,features_list=[],[]for index in range(len(word_grams)):for i in range(len(word_grams[index])):word_gram=word_grams[index][i]feature={u'w-1':word_gram[0],u'w':word_gram[1],u'w+1':word_gram[2],u'w-1:w':word_gram[0]+word_gram[1],u'w:w+1':word_gram[1]+word_gram[2],u'bias':1.0}features_list.append(feature)features.append(features_list)features_list=[]return features

（6）組合CRF模型的輸入數據。這個函數將滑動窗口函數和特征抽取函數組合起來，并形成最后輸入到CRF模型中的數據。

def generator(self):# word_gram此時是三維的，句子->片段->字word_grams = [self.segment_by_window(word_list) for word_list in self.word_seq]features = self.extract_feature(word_grams)return features, self.tag_seq

模型構建

設置CRF模型的初始化參數。algorithm表示優化算法；c1表示L1正則化系數；c2表示L2正則化系數；max_iteration表示最大迭代次數；model_path表示模型的保存路徑；然后初始化語料。

def __init__(self):self.algorithm="lbfgs"self.c1=0.1self.c2=0.2self.max_iterations=100self.model_path="TCM_model.pkl"self.corpus=init_process()self.corpus_text=self.corpus.pre_process()self.corpus.init(self.corpus_text)self.model=Nonedef init_model(self):algorithm=self.algorithmc1=float(self.c1)c2=float(self.c2)max_iterations=self.max_iterationsself.model=sklearn_crfsuite.CRF(algorithm=algorithm,c1=c1,c2=c2,max_iterations=max_iterations,all_possible_transitions=True)

模型訓練

初始化模型及語料后，劃分數據集和測試集，生成輸入數據并對模型進行訓練，使用metrics模塊來評估模型效果。

def train(self):self.init_model()x,y=self.corpus.generator()x_train,y_train=x[1000:],y[1000:]x_test,y_test=x[:1000],y[:1000]self.model.fit(x_train,y_train)labels=list(self.model.classes_)labels.remove("O")y_predict=self.model.predict(x_test)metrics.flat_f1_score(y_test,y_predict,average="weighted",labels=labels)sorted_labels=sorted(labels,key=lambda name:(name[1:],name[0]))print(metrics.flat_classification_report(y_test,y_predict,labels=sorted_labels,digits=3))self.save_model()

模型預測

訓練好模型后，就可以使用模型來進行預測了，但預測函數輸出的結果并不直觀，還需要做一些處理。

def predict(self,sentence):self.load_model()word_lists=[[u'BOS']+[word for word in sentence]+[u'EOS']]word_gram=[self.corpus.segment_by_window(word_list) for word_list in word_lists]print(word_lists)features=self.corpus.extract_features(word_gram)y_predict=self.model.predict(features)print(y_predict)entity=""for index in range(len(y_predict[0])):if y_predict[0][index] != u'O':entity += sentence[index]if index<len(y_predict[0])-1 and y_predict[0][index][-2:] != y_predict[0][index+1][-2:]:entity+=" "return entity

結果分析

模型訓練的結果如下圖所示，對于大部分標簽，其精確率和召回率都算不錯，support表示標簽出現的次數。

使用模型來對這樣一個新句子進行命名實體識別：?“服藥五日，漸變神昏譫語，胸腹滿痛，舌干不飲水，小便清長。”命名實體識別的結果如下：

程序源代碼?

import sklearn_crfsuite from sklearn_crfsuite import metrics import joblibclass init_process(object):def __init__(self):self.file_path="中醫語料.txt"def read_file(self):f=open(self.file_path,encoding="utf-8")line=f.readlines()f.close()return linedef pre_process(self):lines=self.read_file()new_lines=[]list = []for line in lines:line=line.strip()if len(line) == 0:new_lines.append(list)list=[]else:list.append(line)return new_linesdef extract_features(self,word_grams):features,features_list=[],[]for index in range(len(word_grams)):for i in range(len(word_grams[index])):word_gram=word_grams[index][i]feature={u'w-1':word_gram[0],u'w':word_gram[1],u'w+1':word_gram[2],u'w-1:w':word_gram[0]+word_gram[1],u'w:w+1':word_gram[1]+word_gram[2],u'bias':1.0}features_list.append(feature)features.append(features_list)features_list=[]return featuresdef segment_by_window(self,word_list):words=[]begin,end=0,3for _ in range(1,len(word_list)):if end >len(word_list):breakwords.append(word_list[begin:end])begin+=1end+=1return wordsdef init(self,word_list):self.word_seq=[[u'<BOS>']+[word.split(" ")[0] for word in words]+[u'<EOS>'] for words in word_list]self.tag_seq=[[word.split(" ")[1].replace("E","I") for word in words] for words in word_list]def generator(self):word_grams=[self.segment_by_window(word) for word in self.word_seq]features=self.extract_features(word_grams)return features,self.tag_seqclass ner(object):def __init__(self):self.algorithm="lbfgs"self.c1=0.1self.c2=0.2self.max_iterations=100self.model_path="TCM_model.pkl"self.corpus=init_process()self.corpus_text=self.corpus.pre_process()self.corpus.init(self.corpus_text)self.model=Nonedef init_model(self):algorithm=self.algorithmc1=float(self.c1)c2=float(self.c2)max_iterations=self.max_iterationsself.model=sklearn_crfsuite.CRF(algorithm=algorithm,c1=c1,c2=c2,max_iterations=max_iterations,all_possible_transitions=True)def train(self):self.init_model()x,y=self.corpus.generator()x_train,y_train=x[1000:],y[1000:]x_test,y_test=x[:1000],y[:1000]self.model.fit(x_train,y_train)labels=list(self.model.classes_)labels.remove("O")y_predict=self.model.predict(x_test)metrics.flat_f1_score(y_test,y_predict,average="weighted",labels=labels)sorted_labels=sorted(labels,key=lambda name:(name[1:],name[0]))print(metrics.flat_classification_report(y_test,y_predict,labels=sorted_labels,digits=3))self.save_model()def predict(self,sentence):self.load_model()word_lists=[[u'BOS']+[word for word in sentence]+[u'EOS']]word_gram=[self.corpus.segment_by_window(word_list) for word_list in word_lists]print(word_lists)features=self.corpus.extract_features(word_gram)y_predict=self.model.predict(features)print(y_predict)entity=""for index in range(len(y_predict[0])):if y_predict[0][index] != u'O':entity += sentence[index]if index<len(y_predict[0])-1 and y_predict[0][index][-2:] != y_predict[0][index+1][-2:]:entity+=" "return entitydef save_model(self):joblib.dump(self.model,self.model_path)def load_model(self):self.model=joblib.load(self.model_path)NER=ner() NER.train() print(NER.predict("服藥五日，漸變神昏譫語，胸腹滿痛，舌干不飲水，小便清長。"))

總結

以上是生活随笔為你收集整理的CRF用于命名实体识别（快速上手实现）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：免费下载思科CCNP 642-055考试
下一篇： FCPX插件:屏幕分屏特效插件Stupi