當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

从头开始训练一个 NER 标注器

發布時間：2024/1/8 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了从头开始训练一个 NER 标注器小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 從頭開始訓練一個 NER 標注器
- 一、自定義模型
- - - 1、導入所需要的包與模塊
    - 2、導入訓練樣本
- 二、訓練模型
- - - 1、對現有的模型進行優化
    - 2、創建內置管道組件
    - 3、添加train data的標簽
    - 4、構建模型
    - 5、模型保存
- 三、模型測試
- 參考

從頭開始訓練一個 NER 標注器

NER 標注的中文名為命名實體識別，與詞性標注一樣是自然語言處理的技術基礎之一。NER 標注是指對現實世界中某個對象的名稱的識別，例如法國、Donald Trump 或者微信。在這些詞匯中法國是一個國家，標識為 GPE（地緣整治實體），Donald Trump 標識為 PER（人名），微信是一家公司，因此被標識為 ORG（組織）。

在spaCy的模塊中常見的實體類型有：

NER標注的作用：
1）顯而易見最主要的是通過模型可以識別出文本中需要的實體。
2）可以推導出實體之間的關系；例如，Rome is the capital of Italy，可以根據實體識別可以判斷出 Rome 是意大利的城市而不是 R&B 藝術家，這項工作叫實體消岐（NED）；

NED的使用場景可以在醫學研究中消除詞語歧義鑒定基因和基因產物，寫作風格分析等。

接下來開始 spaCy 訓練 NER 標注器。

注：本文使用 spaCy 3.0 代碼實現。

一、自定義模型

1、導入所需要的包與模塊

from __future__ import unicode_literals, print_function import plac import random from pathlib import Path import spacy from spacy.training import Example from spacy.tokens import Doc

2、導入訓練樣本

實體標注的索引從 0 開始 17 是最后一字符的索引 +1 ，索引參考python索引方法

# training data TRAIN_DATA = [('Who is Shaka Khan?', {'entities': [(7, 17, 'PERSON')] }),('I like London and Berlin.', {'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}) ]

雖然示例中的訓練樣本數量不多，但是具有代表性。

二、訓練模型

1、對現有的模型進行優化

if model is not None:nlp = spacy.load(model) # 加載存在的模型 print("Loaded model '%s'" % model)else:nlp = spacy.blank('en') # 創建空白模型print("Created blank 'en' model")

2、創建內置管道組件

使用 add_pipeline函數創建流水線

if 'ner' not in nlp.pipe_names:ner = nlp.create_pipe('ner')nlp.add_pipe('ner', last=True)else:ner = nlp.get_pipe('ner')

3、添加train data的標簽

for _, annotations in TRAIN_DATA:for ent in annotations.get('entities'):ner.add_label(ent[2])

4、構建模型

訓練過程本身很簡單，nlp.update()方法為我們抽象了所有內容,由 spaCy 處理實際的機器學習和訓練過程。

# 禁用流水線中所有其他組件，以便只訓練/更新NER標注器 other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']with nlp.disable_pipes(*other_pipes): # 僅訓練我們標注的標簽，假如沒有則會對所有的標簽訓練，for itn in range(n_iter):random.shuffle(TRAIN_DATA) # 訓練數據每次迭代打亂順序losses = {} # 定義損失函數for text, annotations in TRAIN_DATA:# 對數據進行整理成新模型需要的數據example = Example.from_dict(nlp.make_doc(text), annotations) print("example:",example)nlp.update([example], # 批注drop=0.5,sgd=optimizer, # 更新權重losses=losses)print(losses)

5、模型保存

if output_dir is not None:output_dir = Path(output_dir)if not output_dir.exists():output_dir.mkdir()nlp.to_disk(output_dir)print("Saved model to", output_dir)

三、模型測試

def load_model_test(path,text):nlp = spacy.load(path)print("Loading from", path)doc = nlp(text)for i in doc.ents:print(i.text,i.label_)if __name__ == "__main__":path = "./model/"text = "Who is Shaka Khan"load_model_test(path,text)

模型的效果如下

Loading from ./model/ Shaka Khan PERSON

可以的到 Shaka Khan 標注為 PERSON，即人名。

參考

【法】巴格夫·斯里尼瓦薩-德西坎.《自然語言處理與計算語言學》.人民郵電出版社

總結

以上是生活随笔為你收集整理的从头开始训练一个 NER 标注器的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。