當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

IK分词器应用

發布時間：2023/12/10 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 IK分词器应用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本篇博客的主要目的是介紹IK分詞器與ES的集成使用.

IK分詞器簡介
IKAnalyzer是一個開源的，基于java語言開發的輕量級的中文分詞工具包。從2006年12月推出1.0版開始，IKAnalyzer已經推出了3個大版本。最初，它是以開源項目Lucene為應用主體的，結合詞

典分詞和文法分析算法的中文分詞組件。新版本的IKAnalyzer3.0則發展為面向Java的公用分詞組件，獨立于Lucene項目，同時提供了對Lucene的默認優化實現。

IK分詞器3.0的特性如下：

1）采用了特有的“正向迭代最細粒度切分算法“，具有60萬字/秒的高速處理能力。

2）采用了多子處理器分析模式，支持：英文字母（IP地址、Email、URL）、數字（日期，常用中文數量詞，羅馬數字，科學計數法），中文詞匯（姓名、地名處理）等分詞處理。

3）對中英聯合支持不是很好,在這方面的處理比較麻煩.需再做一次查詢,同時是支持個人詞條的優化的詞典存儲，更小的內存占用。

4）支持用戶詞典擴展定義。 5）針對Lucene全文檢索優化的查詢分析器IKQueryParser；采用歧義分析算法優化查詢關鍵字的搜索排列組合，能極大的提高Lucene檢索的命中率。

IK分詞器搭建
https://www.cnblogs.com/sjfxwj/p/14547402.html

IK分詞器測試使用
IK提供了兩個分詞算法ik_smart 和 ik_max_word，其中 ik_smart 為最少切分，ik_max_word為最細粒度劃分。

ik_max_word
輸入數據：

curl -XGET “http://192.168.140.101:9200/_analyze?pretty” -H ‘Content-Type: application/json’ -d’
{
“text”:“聯想是全球最大的筆記本廠商”,“tokenizer”: “ik_max_word”
}’
運行結果：

[root@s101 /usr/local/software/elasticsearch-head]#curl -XGET “http://192.168.140.101:9200/_analyze?pretty” -H ‘Content-Type: application/json’ -d’

{
“text”:“聯想是全球最大的筆記本廠商”,“tokenizer”: “ik_max_word”
}’
{
“tokens” : [
{
“token” : “聯想”,
“start_offset” : 0,
“end_offset” : 2,
“type” : “CN_WORD”,
“position” : 0
},
{
“token” : “是”,
“start_offset” : 2,
“end_offset” : 3,
“type” : “CN_CHAR”,
“position” : 1
},
{
“token” : “全球”,
“start_offset” : 3,
“end_offset” : 5,
“type” : “CN_WORD”,
“position” : 2
},
{
“token” : “最大”,
“start_offset” : 5,
“end_offset” : 7,
“type” : “CN_WORD”,
“position” : 3
},
{
“token” : “的”,
“start_offset” : 7,
“end_offset” : 8,
“type” : “CN_CHAR”,
“position” : 4
},
{
“token” : “筆記本”,
“start_offset” : 8,
“end_offset” : 11,
“type” : “CN_WORD”,
“position” : 5
},
{
“token” : “筆記”,
“start_offset” : 8,
“end_offset” : 10,
“type” : “CN_WORD”,
“position” : 6
},
{
“token” : “本廠”,
“start_offset” : 10,
“end_offset” : 12,
“type” : “CN_WORD”,
“position” : 7
},
{
“token” : “本”,
“start_offset” : 10,
“end_offset” : 11,
“type” : “CN_CHAR”,
“position” : 8
},
{
“token” : “廠商”,
“start_offset” : 11,
“end_offset” : 13,
“type” : “CN_WORD”,
“position” : 9
}
]
}

ik_smart
輸入數據：

curl -XGET “http://192.168.140.101:9200/_analyze?pretty” -H ‘Content-Type: application/json’ -d’
{
“text”:“聯想是全球最大的筆記本廠商”,“tokenizer”: “ik_smart”
}’
運行結果：

[root@s101 /usr/local/software/elasticsearch-head]#curl -XGET “http://192.168.140.101:9200/_analyze?pretty” -H ‘Content-Type: application/json’ -d’

{
“text”:“聯想是全球最大的筆記本廠商”,“tokenizer”: “ik_smart”
}’
{
“tokens” : [
{
“token” : “聯想”,
“start_offset” : 0,
“end_offset” : 2,
“type” : “CN_WORD”,
“position” : 0
},
{
“token” : “是”,
“start_offset” : 2,
“end_offset” : 3,
“type” : “CN_CHAR”,
“position” : 1
},
{
“token” : “全球”,
“start_offset” : 3,
“end_offset” : 5,
“type” : “CN_WORD”,
“position” : 2
},
{
“token” : “最大”,
“start_offset” : 5,
“end_offset” : 7,
“type” : “CN_WORD”,
“position” : 3
},
{
“token” : “的”,
“start_offset” : 7,
“end_offset” : 8,
“type” : “CN_CHAR”,
“position” : 4
},
{
“token” : “筆記本”,
“start_offset” : 8,
“end_offset” : 11,
“type” : “CN_WORD”,
“position” : 5
},
{
“token” : “廠商”,
“start_offset” : 11,
“end_offset” : 13,
“type” : “CN_WORD”,
“position” : 6
}
]
}

IK分詞器和ES的集成使用
我們先創建索引：

curl -XPUT ‘http://192.168.140.101:9200/blog?pretty’

curl -H “Content-Type: application/json” -XPUT http://192.168.140.101:9200/blog/article/_mapping -d ‘{
“properties”: {
“id”: {
“type”: “long”,
“store”: true,
“index”: false
},
“title”: {
“type”: “text”,
“store”: true,
“index”: true,
“analyzer”: “ik_smart”
},
“content”: {
“type”: “text”,
“store”: true,
“index”: true,
“analyzer”: “ik_smart”
}
}
}’
隨后我們向該索引當中插入三條數據：

curl -H “Content-Type: application/json” -XPUT ‘http://192.168.140.101:9200/blog/article/1?pretty’ -d ‘{
“id”: 1,
“title”:“明日起70歲退休時代來臨日本終于邁出這一步”,
“content”:“辛巴復出直播當天周圍路被封了街道回應”
}’

curl -H “Content-Type: application/json” -XPUT ‘http://192.168.140.101:9200/blog/article/2?pretty’ -d ‘{
“id”: 2,
“title”:“海南省市場監管局對椰樹涉嫌違法廣告進行立案調查”,
“content”:“小米官宣造車 “賭”上全部的雷軍這次會贏嗎？”
}’

curl -H “Content-Type: application/json” -XPUT ‘http://192.168.140.101:9200/blog/article/3?pretty’ -d ‘{
“id”: 3,
“title”:“韓國禁止對朝發傳單卻被美國列為重大人權問題”,
“content”:“辛巴復出直播當天周圍路被封了街道回應”
}’
運行結果：

查詢文檔方式2-term關鍵詞在倒排索引當中進行查詢(關鍵詞不分詞)：

curl -H “Content-Type: application/json” -XGET ‘http://192.168.140.101:9200/blog/article/_search/?pretty’ -d ‘{
“query”: {
“term”: {
“title”: “明日”
}
}
}’
運行結果：

[root@s101 /usr/local/software/elasticsearch-head]#curl -H “Content-Type: application/json” -XGET ‘http://192.168.140.101:9200/blog/article/_search/?pretty’ -d '{

"query": {"term": {"title": "明日"} }

}’
{
“took” : 25,
“timed_out” : false,
“_shards” : {
“total” : 5,
“successful” : 5,
“skipped” : 0,
“failed” : 0
},
“hits” : {
“total” : 1,
“max_score” : 0.2876821,
“hits” : [
{
“_index” : “blog”,
“_type” : “article”,
“_id” : “1”,
“_score” : 0.2876821,
“_source” : {
“id” : 1,
“title” : “明日起70歲退休時代來臨日本終于邁出這一步”,
“content” : “辛巴復出直播當天周圍路被封了街道回應”
}
}
]
}
}　
查詢文檔方式3-使用query_string字符串進行分析查詢(字符串–>分詞器分詞–>分詞的結果在倒排索引當中進行查詢):

curl -H “Content-Type: application/json” -XGET ‘http://192.168.140.101:9200/blog/article/_search/?pretty’ -d ‘{
“query”: {
“query_string”: {
“default_field”: “title”,
“query”: “日本和美國”
}
}
}’
運行結果：

[root@s101 /usr/local/software/elasticsearch-head]#curl -H “Content-Type: application/json” -XGET ‘http://192.168.140.101:9200/blog/article/_search/?pretty’ -d '{

"query": {"query_string": {"default_field": "title","query": "日本和美國"} }

}’
{
“took” : 9,
“timed_out” : false,
“_shards” : {
“total” : 5,
“successful” : 5,
“skipped” : 0,
“failed” : 0
},
“hits” : {
“total” : 2,
“max_score” : 0.2876821,
“hits” : [
{
“_index” : “blog”,
“_type” : “article”,
“_id” : “1”,
“_score” : 0.2876821,
“_source” : {
“id” : 1,
“title” : “明日起70歲退休時代來臨日本終于邁出這一步”,
“content” : “辛巴復出直播當天周圍路被封了街道回應”
}
},
{
“_index” : “blog”,
“_type” : “article”,
“_id” : “3”,
“_score” : 0.2876821,
“_source” : {
“id” : 3,
“title” : “韓國禁止對朝發傳單卻被美國列為重大人權問題”,
“content” : “辛巴復出直播當天周圍路被封了街道回應”
}
}
]
}
}

總結

以上是生活随笔為你收集整理的IK分词器应用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

ik
分词

上一篇：车间生产能耗管控方案_SAREN三仁净化
下一篇： 1标志图片_这四种情况将不再扣分罚款！1