當前位置：首頁 > 编程语言 > python >内容正文

python

Python 操作 Elasticsearch 实现增删改查

發(fā)布時間：2024/7/23 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 操作 Elasticsearch 实现增删改查小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Github 地址：https://github.com/elastic/elasticsearch-py/blob/master/docs/index.rst

官網(wǎng)地址：https://elasticsearch-py.readthedocs.io/en/latest/index.html

Python-ElasticSearch，python對ES進行寫入、更新、刪除、搜索：
https://blog.csdn.net/u013429010/article/details/81746179

Python3 操作 elasticsearch：https://blog.csdn.net/qq_41262248/article/details/100671930

Elasticsearch 簡介

想查數(shù)據(jù)就免不了搜索，搜索就離不開搜索引擎，百度、谷歌都是一個非常龐大復(fù)雜的搜索引擎，他們幾乎索引了互聯(lián)網(wǎng)上開放的所有網(wǎng)頁和數(shù)據(jù)。然而對于我們自己的業(yè)務(wù)數(shù)據(jù)來說，肯定就沒必要用這么復(fù)雜的技術(shù)了，如果我們想實現(xiàn)自己的搜索引擎，方便存儲和檢索，Elasticsearch 就是不二選擇，它是一個全文搜索引擎，可以快速地?儲存、搜索?和?分析?海量數(shù)據(jù)。

為什么要用 Elasticsearch ？

Elasticsearch 是一個開源的搜索引擎，建立在一個全文搜索引擎庫 Apache Lucene? 基礎(chǔ)之上。

那 Lucene 又是什么？Lucene 可能是目前存在的，不論開源還是私有的，擁有最先進，高性能和全功能搜索引擎功能的庫，但也僅僅只是一個庫。要用上 Lucene，我們需要編寫 Java 并引用 Lucene 包才可以，而且我們需要對信息檢索有一定程度的理解才能明白 Lucene 是怎么工作的，反正用起來沒那么簡單。

那么為了解決這個問題，Elasticsearch 就誕生了。Elasticsearch 也是使用 Java 編寫的，它的內(nèi)部使用 Lucene 做索引與搜索，但是它的目標是使全文檢索變得簡單，相當于 Lucene 的一層封裝，它提供了一套簡單一致的 RESTful API 來幫助我們實現(xiàn)存儲和檢索。

所以 Elasticsearch 僅僅就是一個簡易版的 Lucene 封裝嗎？那就大錯特錯了，Elasticsearch 不僅僅是 Lucene，并且也不僅僅只是一個全文搜索引擎。它可以被下面這樣準確的形容：

一個分布式的實時文檔存儲，每個字段可以被索引與搜索
一個分布式實時分析搜索引擎
能勝任上百個服務(wù)節(jié)點的擴展，并支持 PB 級別的結(jié)構(gòu)化或者非結(jié)構(gòu)化數(shù)據(jù)

總之，是一個相當牛逼的搜索引擎，維基百科、Stack Overflow、GitHub 都紛紛采用它來做搜索。

Elasticsearch 相關(guān)概念

Node 和 Cluster

Elasticsearch 本質(zhì)上是一個分布式數(shù)據(jù)庫，允許多臺服務(wù)器協(xié)同工作，每臺服務(wù)器可以運行多個 Elasticsearch 實例。單個 Elasticsearch 實例稱為一個節(jié)點（Node）。一組節(jié)點構(gòu)成一個集群（Cluster）。

Index

Elasticsearch 會索引所有字段，經(jīng)過處理后寫入一個反向索引（Inverted Index）。查找數(shù)據(jù)的時候，直接查找該索引。所以，Elasticsearch 數(shù)據(jù)管理的頂層單位就叫做 Index（索引），其實就相當于 MySQL、MongoDB 等里面的數(shù)據(jù)庫的概念。另外值得注意的是，每個 Index （即數(shù)據(jù)庫）的名字必須是小寫。

Document

Index 里面單條的記錄稱為 Document（文檔）。許多條 Document 構(gòu)成了一個 Index。Document 使用 JSON 格式表示，下面是一個例子。同一個 Index 里面的 Document，不要求有相同的結(jié)構(gòu)（scheme），但是最好保持相同，這樣有利于提高搜索效率。

Type

Document 可以分組，比如 weather 這個 Index 里面，可以按城市分組（北京和上海），也可以按氣候分組（晴天和雨天）。這種分組就叫做 Type，它是虛擬的邏輯分組，用來過濾 Document，類似 MySQL 中的數(shù)據(jù)表，MongoDB 中的 Collection。不同的 Type 應(yīng)該有相似的結(jié)構(gòu)（Schema），舉例來說，id 字段不能在這個組是字符串，在另一個組是數(shù)值。這是與關(guān)系型數(shù)據(jù)庫的表的一個區(qū)別。性質(zhì)完全不同的數(shù)據(jù)（比如 products 和 logs）應(yīng)該存成兩個 Index，而不是一個 Index 里面的兩個 Type（雖然可以做到）。

根據(jù)規(guī)劃，Elastic 6.x 版只允許每個 Index 包含一個 Type，7.x 版將會徹底移除 Type。

Fields

field 即字段，每個 Document 都類似一個 JSON 結(jié)構(gòu)，它包含了許多字段，每個字段都有其對應(yīng)的值，多個字段組成了一個 Document，其實就可以類比 MySQL 數(shù)據(jù)表中的字段。

在 Elasticsearch 中，文檔歸屬于一種類型（Type），而這些類型存在于索引（Index）中，我們可以畫一些簡單的對比圖來類比傳統(tǒng)關(guān)系型數(shù)據(jù)庫：

關(guān)系型數(shù)據(jù)庫? ->?Databases?->?Tables?->?Rows?->?Columns Elasticsearch?->?Indices???->?Types??->?Documents?->?Fields

以上就是 Elasticsearch 里面的一些基本概念，通過和關(guān)系性數(shù)據(jù)庫的對比更加有助于理解。

Python Elasticsearch Client

? ? ? ? elasticsearch-py 是一個正式的低級別的?Elasticsearch 客戶端。它的目標是為所有與 elasticsearch 相關(guān)的 Python 代碼提供公共基礎(chǔ)。正因為如此，它試圖做到無意見和可擴展。

? ? ? ? 對于更高級、范圍更有限的客戶端庫，請查看?elasticsearch-dsl?(?https://elasticsearch-dsl.readthedocs.io/en/latest/ )，它是一個位于 elasticsearch-py 之上的更 python 化的庫。

兼容性

該庫與所有的Elasticsearch版本兼容(從0.90開始)。但你必須使用一個匹配的主要版本:

For?Elasticsearch 6.0?and later, use the major version 6 (6.x.y) of the library.

For?Elasticsearch 5.0?and later, use the major version 5 (5.x.y) of the library.

For?Elasticsearch 2.0?and later, use the major version 2 (2.x.y) of the library, and so on.

The recommended way to set your requirements in your?setup.py?or?requirements.txt?is:

# Elasticsearch 6.x elasticsearch>=6.0.0,<7.0.0# Elasticsearch 5.x elasticsearch>=5.0.0,<6.0.0# Elasticsearch 2.x elasticsearch>=2.0.0,<3.0.0

安裝

使用 pip 安裝?elasticsearch? 包

pip install elasticsearch # 豆瓣源 pip install elasticsearch -i https://pypi.doubanio.com/simple/

Python 連接 elasticsearch

Python 連接 elasticsearch 有以下幾種連接方式：

from elasticsearch import Elasticsearch # es = Elasticsearch() # 默認連接本地elasticsearch # es = Elasticsearch(['127.0.0.1:9200']) # 連接本地9200端口 es = Elasticsearch(["192.168.1.10", "192.168.1.11", "192.168.1.12"], # 連接集群，以列表的形式存放各節(jié)點的IP地址sniff_on_start=True, # 連接前測試sniff_on_connection_fail=True, # 節(jié)點無響應(yīng)時刷新節(jié)點sniff_timeout=60 # 設(shè)置超時時間 )

指定連接

es = Elasticsearch(['172.16.153.129:9200'],# 認證信息# http_auth=('elastic', 'changeme') )

動態(tài)連接

es = Elasticsearch(['esnode1:port', 'esnode2:port'],# 在做任何操作之前，先進行嗅探sniff_on_start=True,# 節(jié)點沒有響應(yīng)時，進行刷新，重新連接sniff_on_connection_fail=True,# 每 60 秒刷新一次sniffer_timeout=60 )

對不同的節(jié)點，賦予不同的參數(shù)

es = Elasticsearch([{'host': 'localhost'},{'host': 'othernode', 'port': 443, 'url_prefix': 'es', 'use_ssl': True}, ])

假如使用了 ssl

es = Elasticsearch(['localhost:443', 'other_host:443'],＃打開SSL use_ssl=True,＃確保我們驗證了SSL證書（默認關(guān)閉）verify_certs=True,＃提供CA證書的路徑ca_certs='/path/to/CA_certs',＃PEM格式的SSL客戶端證書client_cert='/path/to/clientcert.pem',＃PEM格式的SSL客戶端密鑰client_key='/path/to/clientkey.pem' )

配置忽略響應(yīng)狀態(tài)碼

es = Elasticsearch(['127.0.0.1:9200'],ignore=400) # 忽略返回的400狀態(tài)碼 es = Elasticsearch(['127.0.0.1:9200'],ignore=[400, 405, 502]) # 以列表的形式忽略多個狀態(tài)碼

一個簡單的示例

from elasticsearch import Elasticsearch es = Elasticsearch() # 默認連接本地elasticsearch print(es.index(index='py2', doc_type='doc', id=1, body={'name': "張開", "age": 18})) print(es.get(index='py2', doc_type='doc', id=1))

第1個 print 為創(chuàng)建?py2?索引，并插入一條數(shù)據(jù)，第 2個 print 查詢指定文檔。

查詢結(jié)果如下：

{'_index': 'py2', '_type': 'doc', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1} {'_index': 'py2', '_type': 'doc', '_id': '1', '_version': 1, 'found': True, '_source': {'name': '張開', 'age': 18}}

示例 1

from datetime import datetime from elasticsearch import Elasticsearch es = Elasticsearch()doc = {'author': 'kimchy','text': 'Elasticsearch: cool. bonsai cool.','timestamp': datetime.now(), } res = es.index(index="test-index", doc_type='tweet', id=1, body=doc) print(res['result'])res = es.get(index="test-index", doc_type='tweet', id=1) print(res['_source'])es.indices.refresh(index="test-index")res = es.search(index="test-index", body={"query": {"match_all": {}}}) print("Got %d Hits:" % res['hits']['total']['value']) for hit in res['hits']['hits']:print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])

示例 2：

# -*- coding: utf-8 -*-from elasticsearch import Elasticsearch# 默認host為localhost,port為9200.但也可以指定host與port es = Elasticsearch()# 插入數(shù)據(jù),index，doc_type名稱可以自定義，id可以根據(jù)需求賦值,body為內(nèi)容 es.index(index="my_index",doc_type="test_type",id=0,body={"name":"python","addr":"深圳"}) es.index(index="my_index",doc_type="test_type",id=1,body={"name":"python","addr":"深圳"})# 同樣是插入數(shù)據(jù)，create() 方法需要我們指定 id 字段來唯一標識該條數(shù)據(jù)， # 而 index() 方法則不需要，如果不指定 id，會自動生成一個 id es.create(index="my_index",doc_type="test_type",id=1,body={"name":"python","addr":"深圳"})#刪除指定的index、type、id的文檔 es.delete(index='indexName', doc_type='typeName', id=1) #刪除index es.indices.delete(index='news', ignore=[400, 404])query = {'query': {'match_all': {}}}# 查找所有文檔 query1 = {'query': {'match': {'sex': 'famale'}}}# 刪除性別為女性的所有文檔 query2 = {'query': {'range': {'age': {'lt': 11}}}}# 刪除年齡小于11的所有文檔 query3 = {'query': {'term': {'name': 'jack'}}}# 查找名字叫做jack的所有文檔#刪除所有文檔 es.delete_by_query(index="my_index",doc_type="test_type",body=query)#get：獲取指定index、type、id所對應(yīng)的文檔 es.get(index="my_index",doc_type="test_type",id=1)#search：查詢滿足條件的所有文檔，沒有id屬性，且index，type和body均可為None result = es.search(index="my_index",doc_type="test_type",body=query) print(result['hits']['hits'][0])# 返回第一個文檔的內(nèi)容#update：更新指定index、type、id所對應(yīng)的文檔 #更新的主要點： #1. 需要指定 id #2. body={"doc": <xxxx>} , 這個doc是必須的 es.update(index="my_index",doc_type="test_type",id=1,body={"doc":{"name":"python1","addr":"深圳1"}})

獲取相關(guān)信息

測試集群是否啟動 In [40]: es.ping() Out[40]: True獲取集群基本信息 In [39]: es.info()獲取集群的健康狀態(tài)信息 In [41]: es.cluster.health()獲取當前連接的集群節(jié)點信息 In [43]: es.cluster.client.info()獲取集群目前所有的索引 In [55]: print(es.cat.indices())獲取集群的更多信息 es.cluster.stats()

利用實例的 cat 屬性得到更簡單易讀的信息

In [85]: es.cat.health() Out[85]: '1510431262 04:14:22 sharkyun yellow 1 1 6 6 0 0 6 0 - 50.0%\n'In [86]: es.cat.master() Out[86]: 'VXgFbKAaTtGO5a1QAfdcLw 172.16.153.129 172.16.153.129 master\n'In [87]: es.cat.nodes() Out[87]: '172.16.153.129 27 49 0 0.02 0.01 0.00 mdi * master\n'In [88]: es.cat.indices()In [89]: es.cat.count() Out[89]: '1510431323 04:15:23 301002\n'In [90]: es.cat.plugins() Out[90]: ''In [91]: es.cat.templates() Out[91]: 'logstash logstash-* 0 50001\nfilebeat filebeat-* 0 \n'

任務(wù)

es.tasks.get()es.tasks.list()

單一操作

查看集群狀態(tài)

from elasticsearch import Elasticsearch es=Elasticsearch([{"host":"localhost","port":9200}]) print(es.cluster.state())

查看集群健康度

from elasticsearch import Elasticsearch es=Elasticsearch([{"host":"localhost","port":9200}]) print(es.cluster.health())

增加一條文檔

from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) print(es.cluster.state()) b= {"name": 'lu', 'sex':'female', 'age': 10} es.index(index='bank', doc_type='typeName',body=b,id=None) print(es.cluster.state())

create() 方法需要指定 id 字段來唯一標識該條數(shù)據(jù)。index() 方法則不需要，如果不指定 id，會自動生成一個 id。
create() 方法內(nèi)部其實也是調(diào)用了 index() 方法，是對 index() 方法的封裝。

刪除一條文檔

from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) es.delete(index='bank', doc_type='typeName', id='idValue')

修改一條文檔

from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) es.update(index='bank', doc_type='typeName', id='idValue', body={待更新字段})

更新操作利用 index() 方法同樣可以做到，index() 方法完成兩個操作，如果數(shù)據(jù)不存在，那就執(zhí)行插入操作，如果已經(jīng)存在，那就執(zhí)行更新操作

查詢一條文檔

from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) find=es.get(index='bank', doc_type='typeName', id='idValue') print(find)

批量操作

從json文件中批量添加文檔

from elasticsearch import Elasticsearch es = Elasticsearch([{"host":"localhost","port":9200}]) with open('./accounts.json','r',encoding='utf-8') as file:s =file.read()print(s)es.bulk(index='bank',doc_type='typeName',body=s)

批量操作：

# -*- coding: utf-8 -*-from elasticsearch import Elasticsearch import os#指定一個文件夾 path = r'C:\Users\Administrator\Desktop\files' es = Elasticsearch() doc = [] i = 1 #獲取文件夾下所有文件的絕對路徑和文件名 for dirname,pathname,filenames in os.walk(path):for filename in filenames:doc.append({"index":{"_id" : i}})doc.append({"filepath":os.path.join(dirname,filename)})i = i + 1 es.bulk(index="test",doc_type="text",body=doc)

按條件刪除文檔

query = {'query': {'match': {'sex': 'famale'}}}# 刪除性別為女性的所有文檔query = {'query': {'range': {'age': {'lt': 11}}}}# 刪除年齡小于51的所有文檔es.delete_by_query(index='indexName', body=query, doc_type='typeName')

條件更新

update_by_query：更新滿足條件的所有數(shù)據(jù)，寫法同上刪除和查詢

按條件查詢文檔

query = {'query': {'match_all': {}}}# 查找所有文檔query = {'query': {'term': {'name': 'jack'}}}# 查找名字叫做jack的所有文檔query = {'query': {'range': {'age': {'gt': 11}}}}# 查找年齡大于11的所有文檔allDoc = es.search(index='indexName', doc_type='typeName', body=query)print allDoc['hits']['hits'][0]# 返回第一個文檔的內(nèi)容

#批量寫入、刪除、更新

doc = [{"index": {}},{'name': 'jackaaa', 'age': 2000, 'sex': 'female', 'address': u'北京'},{"index": {}},{'name': 'jackbbb', 'age': 3000, 'sex': 'male', 'address': u'上海'},{"index": {}},{'name': 'jackccc', 'age': 4000, 'sex': 'female', 'address': u'廣州'},{"index": {}},{'name': 'jackddd', 'age': 1000, 'sex': 'male', 'address': u'深圳'},]doc = [{'index': {'_index': 'indexName', '_type': 'typeName', '_id': 'idValue'}}{'name': 'jack', 'sex': 'male', 'age': 10 }{'delete': {'_index': 'indexName', '_type': 'typeName', '_id': 'idValue'}}{"create": {'_index' : 'indexName', "_type" : 'typeName', '_id': 'idValue'}}{'name': 'lucy', 'sex': 'female', 'age': 20 }{'update': {'_index': 'indexName', '_type': 'typeName', '_id': 'idValue'}}{'doc': {'age': '100'}}]es.bulk(index='indexName'， doc_type='typeName', body=doc)#批量更新也可以采用如下的方式進行json拼裝，最后寫入for line in list:action = {"_index": self.index_name,"_type": self.index_type,"_id": i, #_id 也可以默認生成，不賦值"_source": {"date": line['date'],"source": line['source'].decode('utf8'),"link": line['link'],"keyword": line['keyword'].decode('utf8'),"title": line['title'].decode('utf8')}}i += 1ACTIONS.append(action) success, _ = bulk(self.es, ACTIONS, index=self.index_name, raise_on_error=True)

Python Elasticsearch Client 還提供了很多功能。

參考文檔：
? ? ? ? https://elasticsearch-py.readthedocs.io/en/master/api.html
? ? ? ? https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html

搜索所有數(shù)據(jù)

es.search(index="my_index",doc_type="test_type") # 或者 body = {"query":{"match_all":{}} } es.search(index="my_index",doc_type="test_type",body=body)body = {"query":{"term":{"name":"python"}} } # 查詢name="python"的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)body = {"query":{"terms":{"name":["python","android"]}} } # 搜索出name="python"或name="android"的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#match 與 multi_match

# match:匹配name包含python關(guān)鍵字的數(shù)據(jù) body = {"query":{"match":{"name":"python"}} } # 查詢name包含python關(guān)鍵字的數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)# multi_match:在name和addr里匹配包含深圳關(guān)鍵字的數(shù)據(jù) body = {"query":{"multi_match":{"query":"深圳","fields":["name","addr"]}} } # 查詢name和addr包含"深圳"關(guān)鍵字的數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#ids

body = {"query":{"ids":{"type":"test_type","values":["1","2"]}} } # 搜索出id為1或2d的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#復(fù)合查詢 bool

bool有3類查詢關(guān)系，must(都滿足),should(其中一個滿足),must_not(都不滿足)

body = {"query":{"bool":{"must":[{"term":{"name":"python"}},{"term":{"age":18}}]}} } # 獲取name="python"并且age=18的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#切片式查詢

body = {"query":{"match_all":{}}"from":2 # 從第二條數(shù)據(jù)開始"size":4 # 獲取4條數(shù)據(jù) } # 從第2條數(shù)據(jù)開始，獲取4條數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#范圍查詢

body = {"query":{"range":{"age":{"gte":18, # >=18"lte":30 # <=30}}} } # 查詢18<=age<=30的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#前綴查詢

body = {"query":{"prefix":{"name":"p"}} } # 查詢前綴為"趙"的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#通配符查詢

body = {"query":{"wildcard":{"name":"*id"}} } # 查詢name以id為后綴的所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",body=body)

#排序

body = {"query":{"match_all":{}}"sort":{"age":{ # 根據(jù)age字段升序排序"order":"asc" # asc升序，desc降序}} }

#filter_path
響應(yīng)過濾

# 只需要獲取_id數(shù)據(jù),多個條件用逗號隔開 es.search(index="my_index",doc_type="test_type",filter_path=["hits.hits._id"])# 獲取所有數(shù)據(jù) es.search(index="my_index",doc_type="test_type",filter_path=["hits.hits._*"])

#count
執(zhí)行查詢并獲取該查詢的匹配數(shù)

# 獲取數(shù)據(jù)量 es.count(index="my_index",doc_type="test_type")

#度量類聚合
獲取最小值

body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"min_age":{ # 最小值的key"min":{ # 最小"field":"age" # 查詢"age"的最小值}}} } # 搜索所有數(shù)據(jù)，并獲取age最小的值 es.search(index="my_index",doc_type="test_type",body=body)

獲取最大值

body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"max_age":{ # 最大值的key"max":{ # 最大"field":"age" # 查詢"age"的最大值}}} } # 搜索所有數(shù)據(jù)，并獲取age最大的值 es.search(index="my_index",doc_type="test_type",body=body)

獲取和

body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"sum_age":{ # 和的key"sum":{ # 和"field":"age" # 獲取所有age的和}}} } # 搜索所有數(shù)據(jù)，并獲取所有age的和 es.search(index="my_index",doc_type="test_type",body=body)

獲取平均值

body = {"query":{"match_all":{}},"aggs":{ # 聚合查詢"avg_age":{ # 平均值的key"sum":{ # 平均值"field":"age" # 獲取所有age的平均值}}} } # 搜索所有數(shù)據(jù)，獲取所有age的平均值 es.search(index="my_index",doc_type="test_type",body=body)

更多的搜索用法：https://elasticsearch-py.readthedocs.io/en/master/api.html

查詢數(shù)據(jù)

上面的幾個操作都是非常簡單的操作，普通的數(shù)據(jù)庫如 MongoDB 都是可以完成的，看起來并沒有什么了不起的，Elasticsearch 更特殊的地方在于其異常強大的檢索功能。

對于中文來說，我們需要安裝一個分詞插件，這里使用的是 elasticsearch-analysis-ik，GitHub 鏈接為：https://github.com/medcl/elasticsearch-analysis-ik，這里我們使用 Elasticsearch 的另一個命令行工具 elasticsearch-plugin 來安裝，這里安裝的版本是 6.2.4，請確保和 Elasticsearch 的版本對應(yīng)起來，命令如下：

elasticsearch-plugin?install?https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

這里的版本號請?zhí)鎿Q成你的 Elasticsearch 的版本號。安裝之后重新啟動 Elasticsearch 就可以了，它會自動加載安裝好的插件。首先我們新建一個索引并指定需要分詞的字段，代碼如下：

from?elasticsearch?import?Elasticsearches?=?Elasticsearch() mapping?=?{'properties':?{'title':?{'type':?'text','analyzer':?'ik_max_word','search_analyzer':?'ik_max_word'}} } es.indices.delete(index='news',?ignore=[400,?404]) es.indices.create(index='news',?ignore=400) result?=?es.indices.put_mapping(index='news',?doc_type='politics',?body=mapping) print(result)

這里我們先將之前的索引刪除了，然后新建了一個索引，然后更新了它的 mapping 信息，mapping 信息中指定了分詞的字段，指定了字段的類型 type 為 text，分詞器 analyzer 和搜索分詞器 search_analyzer 為 ik_max_word，即使用我們剛才安裝的中文分詞插件。如果不指定的話則使用默認的英文分詞器。

接下來我們插入幾條新的數(shù)據(jù)：

datas?=?[{'title':?'美國留給伊拉克的是個爛攤子嗎','url':?'http://view.news.qq.com/zt2011/usa_iraq/index.htm','date':?'2011-12-16'},{'title':?'公安部：各地校車將享最高路權(quán)','url':?'http://www.chinanews.com/gn/2011/12-16/3536077.shtml','date':?'2011-12-16'},{'title':?'中韓漁警沖突調(diào)查：韓警平均每天扣1艘中國漁船','url':?'https://news.qq.com/a/20111216/001044.htm','date':?'2011-12-17'},{'title':?'中國駐洛杉磯領(lǐng)事館遭亞裔男子槍擊?嫌犯已自首','url':?'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml','date':?'2011-12-18'} ]for?data?in?datas:es.index(index='news',?doc_type='politics',?body=data)

這里我們指定了四條數(shù)據(jù)，都帶有 title、url、date 字段，然后通過 index() 方法將其插入 Elasticsearch 中，索引名稱為 news，類型為 politics。

接下來我們根據(jù)關(guān)鍵詞查詢一下相關(guān)內(nèi)容：

result = es.search(index='news', doc_type='politics') print(result)

可以看到查詢出了所有插入的四條數(shù)據(jù)：

{"took":?0,"timed_out":?false,"_shards":?{"total":?5,"successful":?5,"skipped":?0,"failed":?0},"hits":?{"total":?4,"max_score":?1.0,"hits":?[{"_index":?"news","_type":?"politics","_id":?"c05G9mQBD9BuE5fdHOUT","_score":?1.0,"_source":?{"title":?"美國留給伊拉克的是個爛攤子嗎","url":?"http://view.news.qq.com/zt2011/usa_iraq/index.htm","date":?"2011-12-16"}},{"_index":?"news","_type":?"politics","_id":?"dk5G9mQBD9BuE5fdHOUm","_score":?1.0,"_source":?{"title":?"中國駐洛杉磯領(lǐng)事館遭亞裔男子槍擊，嫌犯已自首","url":?"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml","date":?"2011-12-18"}},{"_index":?"news","_type":?"politics","_id":?"dU5G9mQBD9BuE5fdHOUj","_score":?1.0,"_source":?{"title":?"中韓漁警沖突調(diào)查：韓警平均每天扣1艘中國漁船","url":?"https://news.qq.com/a/20111216/001044.htm","date":?"2011-12-17"}},{"_index":?"news","_type":?"politics","_id":?"dE5G9mQBD9BuE5fdHOUf","_score":?1.0,"_source":?{"title":?"公安部：各地校車將享最高路權(quán)","url":?"http://www.chinanews.com/gn/2011/12-16/3536077.shtml","date":?"2011-12-16"}}]} }

可以看到返回結(jié)果會出現(xiàn)在 hits 字段里面，然后其中有 total 字段標明了查詢的結(jié)果條目數(shù)，還有 max_score 代表了最大匹配分數(shù)。

另外我們還可以進行全文檢索，這才是體現(xiàn) Elasticsearch 搜索引擎特性的地方：

dsl?=?{'query':?{'match':?{'title':?'中國?領(lǐng)事館'}} }es?=?Elasticsearch() result?=?es.search(index='news',?doc_type='politics',?body=dsl) print(json.dumps(result,?indent=2,?ensure_ascii=False))

這里我們使用 Elasticsearch 支持的 DSL 語句來進行查詢，使用 match 指定全文檢索，檢索的字段是 title，內(nèi)容是“中國領(lǐng)事館”，搜索結(jié)果如下：

{"took":?1,"timed_out":?false,"_shards":?{"total":?5,"successful":?5,"skipped":?0,"failed":?0},"hits":?{"total":?2,"max_score":?2.546152,"hits":?[{"_index":?"news","_type":?"politics","_id":?"dk5G9mQBD9BuE5fdHOUm","_score":?2.546152,"_source":?{"title":?"中國駐洛杉磯領(lǐng)事館遭亞裔男子槍擊，嫌犯已自首","url":?"http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml","date":?"2011-12-18"}},{"_index":?"news","_type":?"politics","_id":?"dU5G9mQBD9BuE5fdHOUj","_score":?0.2876821,"_source":?{"title":?"中韓漁警沖突調(diào)查：韓警平均每天扣1艘中國漁船","url":?"https://news.qq.com/a/20111216/001044.htm","date":?"2011-12-17"}}]} }

這里我們看到匹配的結(jié)果有兩條，第一條的分數(shù)為 2.54，第二條的分數(shù)為 0.28，這是因為第一條匹配的數(shù)據(jù)中含有“中國”和“領(lǐng)事館”兩個詞，第二條匹配的數(shù)據(jù)中不包含“領(lǐng)事館”，但是包含了“中國”這個詞，所以也被檢索出來了，但是分數(shù)比較低。

因此可以看出，檢索時會對對應(yīng)的字段全文檢索，結(jié)果還會按照檢索關(guān)鍵詞的相關(guān)性進行排序，這就是一個基本的搜索引擎雛形。

另外 Elasticsearch 還支持非常多的查詢方式，詳情可以參考官方文檔：https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl.html

特點

該客戶端被設(shè)計為Elasticsearch的REST API的非常薄的包裝，以實現(xiàn)最大的靈活性。這意味著這個客戶沒有意見;這也意味著從Python中使用一些api有點麻煩。我們創(chuàng)建了一些幫助程序來幫助解決這個問題，并在此基礎(chǔ)上創(chuàng)建了一個更高級的庫( Elasticsearch -dsl ：https://elasticsearch-dsl.readthedocs.io/en/latest/?) 來提供使用 Elasticsearch 的更方便的方法。

持久化連接

elasticsearch-py使用各個連接池中的持久連接(每個配置或嗅探節(jié)點一個)。您可以在兩個http協(xié)議實現(xiàn)之間進行選擇。有關(guān)更多信息，請參見傳輸類 ?Transport classes?。

elasticsearch-py?uses persistent connections inside of individual connection pools (one per each configured or sniffed node). Out of the box you can choose between two?http?protocol implementations. See?Transport classes?for more information.

The transport layer will create an instance of the selected connection class per node and keep track of the health of individual nodes - if a node becomes unresponsive (throwing exceptions while connecting to it) it’s put on a timeout by the?ConnectionPool?class and only returned to the circulation after the timeout is over (or when no live nodes are left). By default nodes are randomized before being passed into the pool and round-robin strategy is used for load balancing.

You can customize this behavior by passing parameters to the?Connection Layer API?(all keyword arguments to the?Elasticsearch?class will be passed through). If what you want to accomplish is not supported you should be able to create a subclass of the relevant component and pass it in as a parameter to be used instead of the default implementation.

Automatic Retries （自動重試）

If a connection to a node fails due to connection issues (raises?ConnectionError) it is considered in faulty state. It will be placed on hold for?dead_timeout?seconds and the request will be retried on another node. If a connection fails multiple times in a row the timeout will get progressively larger to avoid hitting a node that’s, by all indication, down. If no live connection is available, the connection that has the smallest timeout will be used.

By default retries are not triggered by a timeout (ConnectionTimeout), set?retry_on_timeout?to?True?to also retry on timeouts.

Sniffing（嗅探）

The client can be configured to inspect the cluster state to get a list of nodes upon startup, periodically and/or on failure. See?Transport?parameters for details.

Some example configurations:

from elasticsearch import Elasticsearch# by default we don't sniff, ever es = Elasticsearch()# you can specify to sniff on startup to inspect the cluster and load # balance across all nodes es = Elasticsearch(["seed1", "seed2"], sniff_on_start=True)# you can also sniff periodically and/or after failure: es = Elasticsearch(["seed1", "seed2"],sniff_on_start=True,sniff_on_connection_fail=True,sniffer_timeout=60)

Thread safety （線程安全）

The client is thread safe and can be used in a multi threaded environment. Best practice is to create a single global instance of the client and use it throughout your application. If your application is long-running consider turning on?Sniffing?to make sure the client is up to date on the cluster location.

By default we allow?urllib3?to open up to 10 connections to each node, if your application calls for more parallelism, use the?maxsize?parameter to raise the limit:

# allow up to 25 connections to each node es = Elasticsearch(["host1", "host2"], maxsize=25)

注意：Since we use persistent connections throughout the client it means that the client doesn’t tolerate?fork?very well. If your application calls for multiple processes make sure you create a fresh client after call to?fork. Note that Python’s?multiprocessing?module uses?fork?to create new processes on POSIX systems.

SSL and Authentication

You can configure the client to use?SSL?for connecting to your elasticsearch cluster, including certificate verification and HTTP auth:

from elasticsearch import Elasticsearch# you can use RFC-1738 to specify the url es = Elasticsearch(['https://user:secret@localhost:443'])# ... or specify common parameters as kwargses = Elasticsearch(['localhost', 'otherhost'],http_auth=('user', 'secret'),scheme="https",port=443, )# SSL client authentication using client_cert and client_keyfrom ssl import create_default_contextcontext = create_default_context(cafile="path/to/cert.pem") es = Elasticsearch(['localhost', 'otherhost'],http_auth=('user', 'secret'),scheme="https",port=443,ssl_context=context, )

警告：elasticsearch-py?doesn’t ship with default set of root certificates. To have working SSL certificate validation you need to either specify your own as?cafile?or?capath?or?cadata?or install?certifi?which will be picked up automatically.

See class?Urllib3HttpConnection?for detailed description of the options.

Logging （日志）

elasticsearch-py?uses the standard?logging library?from python to define two loggers:?elasticsearch?and?elasticsearch.trace.?elasticsearch?is used by the client to log standard activity, depending on the log level.?elasticsearch.trace?can be used to log requests to the server in the form of?curl?commands using pretty-printed json that can then be executed from command line. Because it is designed to be shared (for example to demonstrate an issue) it also just uses?localhost:9200?as the address instead of the actual address of the host. If the trace logger has not been configured already it is set to?propagate=False?so it needs to be activated separately.

Environment considerations （環(huán)境注意事項）

When using the client there are several limitations of your environment that could come into play.

When using an HTTP load balancer you cannot use the?Sniffing?functionality - the cluster would supply the client with IP addresses to directly connect to the cluster, circumventing the load balancer. Depending on your configuration this might be something you don’t want or break completely.

In some environments (notably on Google App Engine) your HTTP requests might be restricted so that?GET?requests won’t accept body. In that case use the?send_get_body_as?parameter of?Transport?to send all bodies via post:

from elasticsearch import Elasticsearch es = Elasticsearch(send_get_body_as='POST')

Compression （壓縮）

When using capacity-constrained networks (low throughput), it may be handy to enable compression. This is especially useful when doing bulk loads or inserting large documents. This will configure compression.

from elasticsearch import Elasticsearch es = Elasticsearch(hosts, http_compress=True)

Running on AWS with IAM

If you want to use this client with IAM based authentication on AWS you can use the?requests-aws4auth?package:

from elasticsearch import Elasticsearch, RequestsHttpConnection from requests_aws4auth import AWS4Authhost = 'YOURHOST.us-east-1.es.amazonaws.com' awsauth = AWS4Auth(YOUR_ACCESS_KEY, YOUR_SECRET_KEY, REGION, 'es')es = Elasticsearch(hosts=[{'host': host, 'port': 443}],http_auth=awsauth,use_ssl=True,verify_certs=True,connection_class=RequestsHttpConnection ) print(es.info())

Customization （定制）

By default,?JSONSerializer?is used to encode all outgoing requests. However, you can implement your own custom serializer:

from elasticsearch.serializer import JSONSerializerclass SetEncoder(JSONSerializer):def default(self, obj):if isinstance(obj, set):return list(obj)if isinstance(obj, Something):return 'CustomSomethingRepresentation'return JSONSerializer.default(self, obj)es = Elasticsearch(serializer=SetEncoder())

API Documentation
- Global options
- Elasticsearch
- Indices
- Ingest
- Cluster
- Nodes
- Cat
- Snapshot
- Tasks
X-Pack APIs
- Info
- Graph Explore
- Licensing API
- Machine Learning APIs
- Security APIs
- Watcher APIs
- Migration APIs
Exceptions
Connection Layer API
- Transport
- Connection Pool
- Connection Selector
- Urllib3HttpConnection (default connection_class)
Transport classes
- Connection
- Urllib3HttpConnection
- RequestsHttpConnection
Helpers
- Bulk helpers
- Scan
- Reindex

總結(jié)

以上是生活随笔為你收集整理的Python 操作 Elasticsearch 实现增删改查的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Lambda 表达式详解~Streams
下一篇： Python 操作 redis