Elasticsearch入门之从零开始安装ik分词器
起因
需要在ES中使用聚合進(jìn)行統(tǒng)計(jì)分析,但是聚合字段值為中文,ES的默認(rèn)分詞器對于中文支持非常不友好:會把完整的中文詞語拆分為一系列獨(dú)立的漢字進(jìn)行聚合,顯然這并不是我的初衷。我們來看個(gè)實(shí)例:
POST http://192.168.80.133:9200/my_index_name/my_type_name/_search {"size": 0,"query" : {"range" : {"time": {"gte": 1513778040000,"lte": 1513848720000}}},"aggs": {"keywords": {"terms": {"field": "keywords"},"aggs": {"emotions": {"terms": {"field": "emotion"}}}} } }輸出結(jié)果:
{"took": 22,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 32,"max_score": 0.0,"hits": []},"aggregations": {"keywords": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "力", # 完整的詞被拆分為獨(dú)立的漢字"doc_count": 2,"emotions": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": -1,"doc_count": 1},{"key": 0,"doc_count": 1}]}},{"key": "動","doc_count": 2,"emotions": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": -1,"doc_count": 1},{"key": 0,"doc_count": 1}]}}]}} }既然ES的默認(rèn)分詞器對于中文支持非常不友好,那么有沒有可以支持中文的分詞器呢?如果有,該如何使用呢?
第一個(gè)問題,萬能的谷歌告訴了我結(jié)果,已經(jīng)有了支持中文的分詞器,而且是開源實(shí)現(xiàn):IK Analysis for Elasticsearch,詳見:https://github.com/medcl/elasticsearch-analysis-ik。
秉著“拿來主義”不重復(fù)造輪子的指導(dǎo)思想,直接先拿過來使用一下,看看效果怎么樣。那么,如何使用IK分詞器呢?其實(shí)這是一個(gè)ES插件,直接安裝并對ES進(jìn)行相應(yīng)的配置即可。
安裝IK分詞器
我的ES版本為2.4.1,需要下載的IK版本為:1.10.1(注意:必須下載與ES版本對應(yīng)的IK,否則不能使用)。
1.下載,編譯IK
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.1/elasticsearch-analysis-ik-1.10.1.zip unzip elasticsearch-analysis-ik-1.10.1.zip cd elasticsearch-analysis-ik-1.10.1 mvn clean package在elasticsearch-analysis-ik-1.10.1\target\releases目錄下生成打包文件:elasticsearch-analysis-ik-1.10.1.zip。
2.在ES中安裝IK插件
將上述打包好的IK插件:elasticsearch-analysis-ik-1.10.1.zip拷貝到ES/plugins目錄下,執(zhí)行解壓。
unzip elasticsearch-analysis-ik-1.10.1.zip rm -rf elasticsearch-analysis-ik-1.10.1.zip # 解壓完之后一定要?jiǎng)h除這個(gè)zip包,否則在啟動ES時(shí)報(bào)錯(cuò)重啟ES。
使用IK分詞器
安裝IK分詞器完畢之后,就可以在ES使用了。
第一步:新建index
PUT http://192.168.80.133:9200/my_index_name第二步:給將來要使用的doc字段添加mapping
在這里我在ES中存儲的doc格式如下:
需要在keywords字段上進(jìn)行聚合分析,所以給keywords字段添加mapping設(shè)置:
POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping {"properties": {"keywords": { # 設(shè)置keywords字段使用ik分詞器"type": "string","store": "no","analyzer": "ik_smart","search_analyzer": "ik_smart","boost": 8}} }注意: 在設(shè)置mapping時(shí)有一個(gè)小插曲,我根據(jù)IK的官網(wǎng)設(shè)置“keywords”的type為“text”時(shí)報(bào)錯(cuò):
POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping {"properties": {"keywords": {"type": "text", # text類型在2.4.1版本中不支持"store": "no","analyzer": "ik_smart","search_analyzer": "ik_smart","boost": 8}} }報(bào)錯(cuò):
{"error": {"root_cause": [{"type": "mapper_parsing_exception","reason": "No handler for type [text] declared on field [keywords]"}],"type": "mapper_parsing_exception","reason": "No handler for type [text] declared on field [keywords]"},"status": 400 }這是因?yàn)槲沂褂玫腅S版本比較低:2.4.1,而text類型是ES5.0之后才添加的類型,所以不支持。在ES2.4.1版本中需要使用string類型。
第三步:添加doc對象
POST http://192.168.80.133:9200/my_index_name/my_type_name/ {"nagtive_kw": ["動力","外觀","油耗"]"is_all": false,"emotion": 0,"focuce": false,"keywords": ["動力","外觀","油耗"], // 在keywords字段上進(jìn)行聚合分析"source": "汽車之家","time": -1,"machine_emotion": 0,"title": "從動次打次吃大餐","spider": "qczj_index","content": {},"url": "http://xxx","brand": "寶馬","series": "寶馬1系","model": "2017款" }第四步:聚合分析
POST http://192.168.80.133:9200/my_index_name/my_type_name/_search {"size": 0,"query" : {"range" : {"time": {"gte": 1513778040000,"lte": 1513848720000}}},"aggs": {"keywords": {"terms": {"field": "keywords"},"aggs": {"emotions": {"terms": {"field": "emotion"}}}} } }輸出結(jié)果:
{"took": 22,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 32,"max_score": 0.0,"hits": []},"aggregations": {"keywords": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "動力", # 完整的詞沒有被拆分為獨(dú)立的漢字"doc_count": 2,"emotions": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": -1,"doc_count": 1},{"key": 0,"doc_count": 1}]}}]}} }【參考】
http://www.cnblogs.com/xing901022/p/5910139.html 如何在Elasticsearch中安裝中文分詞器(IK+pinyin)
https://elasticsearch.cn/question/47 關(guān)于聚合(aggs)的問題
https://github.com/medcl/elasticsearch-analysis-ik/issues/276 create map時(shí)出現(xiàn)No handler for type [text] declared on field [content] #276
http://blog.csdn.net/guo_jia_liang/article/details/52980716 Elasticsearch2.4學(xué)習(xí)(三)------Elasticsearch2.4插件安裝詳解
轉(zhuǎn)載于:https://www.cnblogs.com/nuccch/p/8207261.html
總結(jié)
以上是生活随笔為你收集整理的Elasticsearch入门之从零开始安装ik分词器的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Dos下的小技巧
- 下一篇: BZOJ1951 [Sdoi2010]古