當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

使用HanLP增强Elasticsearch分词功能

發(fā)布時(shí)間：2023/12/20 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了使用HanLP增强Elasticsearch分词功能小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

hanlp-ext 插件源碼地址：http://git.oschina.net/hualongdata/hanlp-ext?或?https://github.com/hualongdata/hanlp-ext

Elasticsearch?默認(rèn)對(duì)中文分詞是按“字”進(jìn)行分詞的，這是肯定不能達(dá)到我們進(jìn)行分詞搜索的要求的。官方有一個(gè)SmartCN?中文分詞插件，另外還有一個(gè) IK 分詞插件使用也比較廣。但這里，我們采用?HanLP?這款?自然語(yǔ)言處理工具?來(lái)進(jìn)行中文分詞。

Elasticsearch

Elasticsearch 的默認(rèn)分詞效果是慘不忍睹的。

GET /_analyze?pretty{"text" : ["重慶華龍網(wǎng)海數(shù)科技有限公司"]}

輸出：

{"tokens": [{"token": "重","start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "慶", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "華", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "龍", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "網(wǎng)", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "海", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "數(shù)", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "科", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "技", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "有", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "限", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "公", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "司", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 12 } ] }

可以看到，默認(rèn)是按字進(jìn)行分詞的。

elasticsearch-hanlp

HanLP

HanLP?是一款使用 Java 實(shí)現(xiàn)的優(yōu)秀的，具有如下功能：

中文分詞
詞性標(biāo)注
命名實(shí)體識(shí)別
關(guān)鍵詞提取
自動(dòng)摘要
短語(yǔ)提取
拼音轉(zhuǎn)換
簡(jiǎn)繁轉(zhuǎn)換
文本推薦
依存句法分析
語(yǔ)料庫(kù)工具

安裝?elasticsearch-hanlp（安裝見(jiàn)：https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin）插件以后，我們?cè)賮?lái)看看分詞效果。

GET /_analyze?pretty{"analyzer" : "hanlp","text" : ["重慶華龍網(wǎng)海數(shù)科技有限公司"]}

輸出：

{"tokens": [{"token": "重慶","start_offset": 0, "end_offset": 2, "type": "ns", "position": 0 }, { "token": "華龍網(wǎng)", "start_offset": 2, "end_offset": 5, "type": "nr", "position": 1 }, { "token": "海數(shù)", "start_offset": 5, "end_offset": 7, "type": "nr", "position": 2 }, { "token": "科技", "start_offset": 7, "end_offset": 9, "type": "n", "position": 3 }, { "token": "有限公司", "start_offset": 9, "end_offset": 13, "type": "nis", "position": 4 } ] }

HanLP 的功能不止簡(jiǎn)單的中文分詞，有很多功能都可以集成到 Elasticsearch 中。

文章來(lái)源于網(wǎng)絡(luò)

轉(zhuǎn)載于:https://www.cnblogs.com/tiantiankong/p/10102841.html

總結(jié)

以上是生活随笔為你收集整理的使用HanLP增强Elasticsearch分词功能的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： ios 后台下载，断点续传总结
下一篇：十三、实现Comparable接口和ne