日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

python使用Elasticsearch对wikipedia的数据进行检索(详细流程)

發(fā)布時間:2024/1/8 python 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python使用Elasticsearch对wikipedia的数据进行检索(详细流程) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1 先下載wikipedia

$wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2 利用wikiextractor將其提取出來

$git clone https://github.com/attardi/wikiextractor $python wikiextractor/WikiExtractor.py INPUT_FILE -o OUTPUT_PATH --json

3 將wikidedia的數(shù)據(jù)進行清洗,具體做法如下

  • 首先用下面的代碼去除doc和空格,只保留文本
  • def deal_enwiki(read_path, write_path): # 去掉doc和空格,只保留文本regex_str = "[^<doc.*>$]|[^</doc>$]"file = open(read_path, "r", encoding="utf-8")output = open(write_path, "w+", encoding="utf-8")content_line = file.readline()article_contents = ""while content_line:match_obj = re.match(regex_str, content_line)if content_line.isspace():content_line = file.readline()continueelse:content_line = content_line.strip("\n")if len(content_line) > 0:if match_obj:article_contents += content_line + " "else:if len(article_contents) > 0:output.write(article_contents + "\n")article_contents = ""# output.write(content_line+"\n")content_line = file.readline()file.close()output.close() def generate_corpus():wiki_path = "/Wiki_processing/wikiextractor/extracted/AA"save_path = "/Wiki_processing/wikiextractor/extracted/AA"for i in range(14):if i < 10:file_path = os.path.join(wiki_path, str("wiki_0%s" % str(i)))else:file_path = os.path.join(wiki_path, str("wiki_%s" % str(i)))deal_enwiki(file_path, os.path.join(save_path, "wiki_corpus%s" % str(i)))print("Finish "+str(i)+" files generate")
  • 將所有的句子變成一個文件,生成wiki_corpus_not_divide
  • def merge_corpus(): # 合并corpusoutput = open("/Wiki_processing/wikiextractor/extracted/AA/wiki_corpus_not_divide", "w", encoding="utf-8")input = "/Wiki_processing/wikiextractor/extracted/AA"for i in range(14):file_path = os.path.join(input, str("wiki_corpus%s" % str(i)))file = open(file_path, "r", encoding="utf-8")line = file.readline()while line:output.writelines(line)line = file.readline()file.close()print("Finish " + str(i) + " files merge")output.close()
  • 將wiki_corpus_not_divide里面的文件變成一句話一行
  • def divide_sentence(read_path, write_path): # 把句子分開punkt_param = PunktParameters()abbreviation = ['i.e', 'e.g', "U.S"]punkt_param.abbrev_types = set(abbreviation)tokenizer = PunktSentenceTokenizer(punkt_param)file = open(read_path, "r", encoding="utf-8")output = open(write_path, "w+", encoding="utf-8")content_line = file.readline()while content_line:sentence_list = tokenizer.tokenize(content_line)for sentence in sentence_list:output.write(sentence + "\n")print("finish divide")content_line = file.readline()file.close()output.close() if __name__ == "__main__":# generate_corpus()# print("generate_corpus is finishing")# merge_corpus()# print("merge_corpus is finishing")divide_sentence("/Wiki_processing/wikiextractor/extracted/AA/wiki_corpus_not_divide","/Wiki_processing/wikiextractor/extracted/AA/wiki_corpus")

    4 利用Elasticsearch進行檢索

  • 進入elasticsearch安裝包里
  • $cd elasticsearch-5.0.0/ $./bin/elasticsearch -d
  • 打開另外一個終端進行測試
  • $curl 'localhost:9200/_cat/health?v'

    如果status為green,則說明為好的

  • 創(chuàng)建索引 mainCreateIndex
  • def deleteInices(my_index):if True and es.indices.exists(my_index): # 確認刪除再改為Trueprint("delete is complete")es.indices.delete(index=my_index)def createIndex(my_index, my_doc):# index settingssettings = \{"mappings": {my_doc: {"properties": {"my_id": {"type": "integer"},"my_word": {"type": "text"}}}}}# create indexes.indices.create(index=my_index, ignore=400, body=settings)print("creating index is succeed!") if __name__ == "__main__":mainCreateIndex()

    運行完上面的.py文件之后,打開終端進行查看,是否創(chuàng)建成功索引

    $curl 'localhost:9200/_cat/indices?v'


    上述,出現(xiàn)index,就說明索引創(chuàng)建成功了

  • 插入wikipedia的文本數(shù)據(jù) mainInsert
  • def insertData(words, my_index, my_doc, one_bulk):# 插入數(shù)據(jù)# one_bulk表示一個bulk里裝多少個body = []body_count = 0 # 記錄body里面有多少個.# 最后一個bulk可能沒滿one_bulk,但也要插入print("need to insert %d" % len(words))pbar = tqdm(total=len(words))for id, word in words:data1 = {"my_id": id, # id"my_word": word} # textevery_body = \{"_index": my_index, # 索引文件名"_type": my_doc, # 數(shù)據(jù)文件名"_source": data1 # 文檔正文}if body_count < one_bulk:body.append(every_body)body_count += 1else:helpers.bulk(es, body) # 還是要用bulk啊,不然太慢了pbar.update(one_bulk)body_count = 0body = []body.append(every_body)body_count += 1if len(body) > 0:# 如果body里面還有,則再插入一次(最后非整塊的)helpers.bulk(es, body)# pbar.update(len(body))print('done2')pbar.close()# res = es.index(index=my_index,doc_type=my_doc,id=my_key_id,body=data1) #一條插入print("insert data completed!")def mainInsert():# 調(diào)用后插入數(shù)據(jù)my_index = "my_index"my_doc = "my_doc"words = getAllWords(path="/Wiki_processing/wikiextractor/extracted/AA/wikipedia_sentences.txt")insertData(words, my_index, my_doc, one_bulk=5000)if __name__ == "__main__":mainInsert()

    運行完上面的.py文件之后,打開終端進行查看,可以通過查看store.size的大小來驗證是否成功插入數(shù)據(jù)

    $curl 'localhost:9200/_cat/indices?v'


    可以看到store.size的大小明顯發(fā)生了變化,說明成功插入數(shù)據(jù)

  • 用自己的數(shù)據(jù)進行檢索 mainSearch
  • def mainSearch():# 調(diào)用后檢索數(shù)據(jù)my_index = "my_index"my_doc = "my_doc"keywords1 = "Vaccinations needed come Doha"keywordSearch(keywords1, my_index, my_doc)if __name__ == "__main__":mainSearch()

    以上就是如果用python針對ES進行檢索的詳細步驟。如有疑惑,歡迎評論交流

    總結(jié)

    以上是生活随笔為你收集整理的python使用Elasticsearch对wikipedia的数据进行检索(详细流程)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。