當前位置：首頁 >

知识图谱学习笔记－命名实体识别

發布時間：2025/4/5 66 豆豆

生活随笔收集整理的這篇文章主要介紹了知识图谱学习笔记－命名实体识别小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、命名實體識別簡單概要

按照類型標記每一個名詞：對句子名詞進行分類

我今天（時間）要去北京（地點）參加面試

張三（人名）出生于上海（地名），清華大學（組織）畢業后去百度（組織）任職。

命名實體識別：1、構建知識圖譜? 2、聊天機器人

如：聊天機器人

機器人：先生，請問有什么可以幫到您的嘛？

客戶：能不能幫我查詢一下明天中午（時間）從北京（地名）飛哈爾濱（地名）的機票？

? ? ? ? ｜命名實體識別

{
時間：明天中午：11:00 - 13:00,

出發：北京? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?－－－－> searchForticket(from,to,time)(填槽)? ? 數據庫? －－>回復（根據模版填槽）

達到：哈爾濱

}

目前又一個滿足要求的航班XXX

二、命名實體識別的評價指標

Contingency Table

?	correct	Not correct
selected	tp	fp
not selected	fn	tn

? ? ? ? ? ? ?精確率: precision = tp/(tp + fp)

? ? ? ? ? ? ?召回率: recall = tp/(tp+fn)

? ? ? ? ? ? ?F1: F1 = 2PR/(P+R)

三、基于規則的命名實體識別

? ? ? 1、基于正則表達方式? ? 2、基于已知的詞典庫（建立詞典庫－匹配（精確、模糊－規則、相似度算法上下文信息））

四、命名實體識別－分類問題

語料

－> 分詞? （名詞準確分為指定的實體名）

－> 特征提取（特征1，特征2，.....，特征D）

－> 分類器? ?f(特征1，特征2.......特征D)

張三? 出生于北京，北京大學畢業后去華為? 任職

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? X - 特征? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Y - label

(張三，－－，出生，－－，張三出生，人名，－－，動詞)? ? ? ? ? ? ? ? ? ? ? ? ?(人名)

(于，出生，北京，出生于，于北京，介詞，動詞，地名)? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (NA)

(北京，于，北京大學，于北京，北京北京大學，地名，介詞，組織)? ? ? ? ? ? (地名)

(北京大學，北京，畢業，北京北京大學，北京大學畢業，組織，地名，動詞) (組織名)

(畢業，北京大學，后，北京大學畢業，畢業后，動詞，組織，方位詞)? ? ? ? ?(NA)

命名實體識別－分類器：

1、時序無關模型：

邏輯回歸、支持向量機、最大熵模型....

2、時序相關模型:

條件隨機場....

五、命名實體識別－分類算法特征提取

The professor Colin proposed a model for NER in 1990

1) 基于單詞的特征(word)? Unigram bigram Trigram.....

－當前詞: Colin? ?

－前/后詞:professor proposed

－前前/后后詞：the? model

2) 基于stemming

proposed - > propose

3) 詞性

當前詞：名

前／后詞：名，動

前前／后后：定冠詞，名詞

4）前綴／后綴

前綴／后綴： pro/ssor

5）當前詞的特點

當前詞的長度

是否包含大寫字母

包含多少個數字

包含了多少個大寫字符

是否包含數字

是否包含符號

是否包含特定的后綴

6）句法分析／依存分析

文本－》提取特征－》1、設計特征? 2、轉化向量形式 3、特征選擇

六、變量類型

1）cateforcal variable(沒有數值大小)? onhot-encoding

－單詞相關的特征：Apple。professor

－小學，高中，大學

－男女

2）Real variable(數值型變量)? 歸一化，不做操作

－身高、體重

－氣溫

3）Ordinal variable? 跟數值一樣，分類變量

－評價1星、2星、.....

－10,9,8,7.........

七、構建分類器：

方法一：根據歷史數據進行統計

from sklearn.base import BaseEstimator,TransformerMixminclass MajorityVotingTagger(BaseEstimator,TransformerMixmin):def fit(self,x,y):"""x:list of wordsy:list of tags"""word2cnt = {}self.tags = []for x,t in zip(x,y):if t not in self.tags:self.tags.append(t)if x in word2cut:if t in word2cnt[x]:word2cnt[x][t] += 1else:word2cnt[x][t] = 1else:word2cnt[x] = {t:1}self.mjvote = {}for k,d in word2cnt.items():self.mjvote[k] = max(d,key=d.get)def predict(self,X,y=None):""" x is user input """return [self.mjvote.get(x,'O') for x in X]words = data['Word'].values.tolist() tags = data['Tag'].values.tolist()from sklearn.cross_validation import cross_val_predict from sklearn.metrics import classification_reportpred = cross_val_predict(estimator=MajorityVotingTagger,X=words,y=tags,cv=5) report = classification_report(y_pred=pred,y_true=tags) print(report)

方法二：通過隨機森林

import numpy as npdef get_feature(word):return np.array([word.istitle(),word.islower(),word.isupper(),len(word),word.isdigit(),word.isalpha(),....])words = [get_feature(w) for w in data['Word'].value.tolist()] tags = data['Tag'].values.tolist()from sklearn.cross_validation import cross_val_predict from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifierpred = cross_val_predict(RandomForestClassifier(n_estimators=20),X=words,y=tags) report = classification_report(y_pred=pred,y_true=tags) print(report)def get_sentences(data):agg_func = lambda s:[(w,p,t) for w,p,t in zip(s["Word".values.tolist(),s["POS"].values.tolist(),s['Tag'].values.tolist()])]sentence_grouped = data.groupby("sentence").apply(agg_func)return [s for s in sentence_grouped]sentence = get_sentence(data)

八、獲取每個單詞的特征

from sklearn.preprocessing import LabelEncoderout = [] y = []mv_tagger = MyjorityVotingTagger() tag_encoder = LabelEncoder() pos_encoder = LabelEncoder()words = data["Words"].values.tolist() tags = data['Tag'].values.tolist() pos = data['POs'].values.tolist()my_tagger.fit(words,tags) tag_encoder.fit(tags) pos_encoder.fit(pos)for sentence in sentence:for i in range(len(sentences)):# w: 單詞 p:詞性 t:NER標簽w,p,t = sentence[i][0],sentence[i][1],sentence[i][2]if i < len(sentence) - 1:# 如果不是句子中最后一個單詞，則可以提取出下文的特征mem_tag_r = tag_encoder.transform(mv_tagger.predict([sentence[i+1][0]]))[0]true_pos_r = pos_encoder.transform([sentencer[i+1][1]])[0]else:mem_tag_r = tag_encoder.transform(['O'])[0]true_pos_r = pos_encoder.transform(['.'])[0]if i > 0:# 如果不是句子中的第一個單詞，則可以提取上文的特征mem_tag_1 = tag_encoder.transform(mv_tagger.predict([sentence[i-1][0]]))[0]true_pos_1 = pos_encoder.transform([sentence[i-1][1]])[0]else:mem_tag_1 = tag_encoder.transform([')'])[0]true_pos_1 = pos_encoder.transform(['.'])[0]# 特征整合out.append(np.array([w.istitle()，w.islower(),w.isuppper(),len(w),w.isdigit(),w.isalpha(),tag_encoder.transform(mv_tagger.predict([w]))[0],pos_encoder.transform([p])[0],mem_tag_r,true_pos_r,mem_tag_1,true_pos_1])) # 標簽 y.append(t)from sklearn.cross_validation import cross_val_predict from sklearn.metrics import classification_reportpred = cross_val_predict(RandomForestClassifier(n_estimators=20),x=out,y=y,cv=5) report = classification_report(y_pred=pred,y_true=y) print(report)

九、通過CRF命名實體識別

def word2features(sentence,i):"""sentence: input sentencei: index of word"""word = sentence[i][0]postag = sentence[i][1]features = {'bias':1.0,'word.lower()':word.lower(),'word[-3:]':word[-3:],'word[-2:]':word[-2:],'word.isupper()':word.isupper(),'word.istitle()':word.istitle(),'word.isdigit()':word.isdigit(),'postag':postag,'postag[:2]':postag[:2],}if i > 0:word1 = sentence[i-1][0]postag1 = sentence[i-1][1]features.update({'-1:word.lower()':word1.lower(),'-1:word.istitle()':word1.istitle(),'-1:word.isupper()':word1.isupper(),'-1:postag':postag1,'-1:postag[:2]':postag1[:2],})else:features['BOS'] = Trueif i < len(sentence)-1:word1 = sentence[i+1][0]postag1 = sentence[i+1][1]features.update(['+1:word.lower()':word1.lower(),'+1:word.istitle()':word1.istitle(),'+1:word.isupper()':word1.isupper(),'+1:postag':postag1,'+1:postag[:2]':postag2])else:features['EOS'] = Truedef sentence2features(sentence):return [word2features(sentence,i) for i in range(len(sentence))]def sentence2labels(sentences):return [label for token,postag,label in sent]X = [sentence2features(s) for s in sentences] y = [sentence2labels(s) for s in sentences]from sklearn_crfsuite import CRFcrf = CRF(algorithm='lbfgs',c1=0.1,c2=0.1,max_iterations=100)from sklearn.cross_validation import cross_val_predict from sklearn.crfsuite.metrics import flat_classification_reportpred = cross_val_predict(estimator=crf,X=x,y=y,cv=5) report = flat_classification_report(y_pred=pred,y_true=y)print(report)

總結

以上是生活随笔為你收集整理的知识图谱学习笔记－命名实体识别的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：知识图谱学习笔记－非结构化数据处理
下一篇：知识图谱学习笔记－风控知识图谱设计

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

知识图谱学习笔记－命名实体识别

總結