當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

05.elasticsearch-index相关总结

發(fā)布時間：2024/2/28 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 05.elasticsearch-index相关总结小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

- 1. 簡介
- 2. index常規(guī)設(shè)置
- - 1. static setting
  - - 1. index.number_of_shards
    - 2. index.shard.check_on_startup
    - 3. index.codec
    - 4. index.routing_partition_size: 這個是自定義routing的時候可以做的一些設(shè)置
    - 5. index.load_fixed_bitset_filters_eagerly
  - 2. dynamic setting
- 3. index的各個功能模塊
- - 1. analysis相關(guān)的特性設(shè)置
  - 2. index shard allocation 相關(guān)的設(shè)置
  - - 1. Shard allocation filtering: 可以控制某個shard分配到某個node
    - - 1. 使用shard allocation filter的一般方式
      - 2. index.routing.allocation可以有的設(shè)置
      - 3. 支持的built-in attribute
    - 2. Delayed allocation
    - - 1. node left之后master會做的操作
      - 2. 延遲分配的設(shè)置
      - 3. 延遲分配的觸發(fā)后的取消
      - 4. 查看unassigned shards
      - 5. recovery的順序
    - 3. Total shards per node: 一個index在一個node上面最多可以有多少個shard
  - 3. maping 設(shè)置
  - 4. merging 設(shè)置，控制shard的background merge 進(jìn)程的工作方式
  - 5. similarities 設(shè)置，自定義相似度計算模型來計算search的result的score
  - - 1. 自定義similarity的方式
    - 2. built-in similarities
    - - 1. BM25 similarity
      - 2. DFR similarity
      - 3. DFI similarity
      - 4. IB similarity
      - 5. LM Dirichlet similarity
      - 6. LM Jelinek Mercer similarity
      - 7. Scripted similarity
  - 6. slowlog 設(shè)置，控制記錄慢查詢
  - - 1. slow search
    - 2. slow index
  - 7. store 設(shè)置，設(shè)置shard data存儲的文件系統(tǒng)的類型
  - - 1. store的類型和設(shè)置
    - 2. 預(yù)加載數(shù)據(jù)導(dǎo)文件系統(tǒng)的緩存中
  - 8. tranlog ，控制transaction log和background flush 操作。
  - - 1. es shard和lucene的關(guān)系
    - 2. luncene_index滿足事務(wù)特性，
    - 3. es如何應(yīng)對增刪改
    - 4. translog的setting
  - 9. index-sorting
  - - 1. index-sorting簡介
    - 2. 可以有以下屬性
    - 3. index-soring的主要功能

1. 簡介

index modules 主要是介紹index各個方面的特性,算是一些總結(jié)和補充
主要內(nèi)容有以下幾方面

常見的index settings: 每個index在index level的設(shè)置，有static和dynamic之分,這些屬性相對來說比較碎一些，相對來說不牽扯另一個完整的功能模塊，而且有些直接是index自己的屬性(比如name,shard數(shù)量等)，所以放在這里

index的各個功能特性

analysis相關(guān)的特性設(shè)置

index shard allocation 相關(guān)的設(shè)置

maping 設(shè)置

merging 設(shè)置，控制shard的background merge 進(jìn)程的工作方式

similarities 設(shè)置，自定義相似度計算模型來計算search的result的score

slowlog 設(shè)置，控制記錄慢查詢

store 設(shè)置，設(shè)置shard data存儲的文件系統(tǒng)的類型

tranlog ，控制transaction log和background flush 操作。

x-pack index想過的設(shè)置

index lifecycle management 索引生命周期管理

2. index常規(guī)設(shè)置

index的常規(guī)設(shè)置一般分為兩類，一類是靜態(tài)的static，這一類一般都是在index create的時候指定，之后不能再進(jìn)行修改(或者在index close之后才能修改)，
還有一類設(shè)置被稱為動態(tài)設(shè)置dynaic setting, 這一類的設(shè)置一般可以通過api進(jìn)行直接的修改。

1. static setting

static的settin總共有以下相關(guān)的設(shè)置

index.number_of_shards: index的primary shard 數(shù)量設(shè)置

index.shard.check_on_startup: 在es啟動的時候?qū)hard做哪些檢查

index.codec: index數(shù)據(jù)存儲的編碼方式（壓縮方式）

index.routing_partition_size: 這個是自定義routing的時候可以做的一些設(shè)置

index.load_fixed_bitset_filters_eagerly: 對于nested query 是否對filter操作做預(yù)加載

1. index.number_of_shards

index的primary shard 數(shù)量設(shè)置, 默認(rèn)為1，即使在closed index上也不能修改，每個index的shard被限制到1024個，
這樣做是為了防止一些意外創(chuàng)建的index(shard過多)占用比較多的資源導(dǎo)致cluster不穩(wěn)定。
這個最大值可以通過java options進(jìn)行設(shè)置,比如

export ES_JAVA_OPTS="-Des.index.max_number_of_shards=128"

2. index.shard.check_on_startup

在es啟動的時候?qū)hard做哪些檢查

false: (default) 不對shard的完整性做檢查
checksum: 只校驗物理存儲上的完整性
true: 不僅會校驗物理存儲上的完整性，還會校驗邏輯存儲上的完整性，這對cpu和memory來講可能是一個非常昂貴的開銷，也有可能會消耗很多時間。

3. index.codec

index數(shù)據(jù)存儲的編碼方式（壓縮方式）
默認(rèn)使用LZ4壓縮方式，可以設(shè)置

"index.codec" : "best_compression"

來開啟使用壓縮率更好的算法DEFLATE，當(dāng)擔(dān)這個算法對cpu的消耗更大，所以也會再一定程度上影響寫入的速度。
如果你更新了這個設(shè)置，新的存儲凡是會在segments merge之后生效。

4. index.routing_partition_size: 這個是自定義routing的時候可以做的一些設(shè)置

這個設(shè)置的作用不是很大，可以參考index的meta field _routing來理解
這個字段的值理論上小于index.number_of_shards（只有index.number_of_shards=1的時候可以相等），默認(rèn)值為1

5. index.load_fixed_bitset_filters_eagerly

對于nested query 是否對filter操作做預(yù)加載,這個可能看了nested filter之后能夠更近異步理解

index.load_fixed_bitset_filters_eagerly
Indicates whether cached filters are pre-loaded for nested queries. Possible values are true (default) and false.

2. dynamic setting

動態(tài)的設(shè)置是指那些可以通過api動態(tài)修改的設(shè)置，相對static設(shè)置萊多，dynaimc 設(shè)置更多一些。

index.number_of_replicas: 副本replica,默認(rèn)為1

index.auto_expand_replicas: 副本數(shù)基于cluster中的node數(shù)進(jìn)行變化，可以設(shè)置一個上限或者下限，默認(rèn)是false, 這種分配方式不是很好

index.search.idle.after: 分片在多久時間內(nèi)沒有接收搜索請求會被認(rèn)為是idle的shard （默認(rèn)為30秒）。

index.refresh_interval: 多久執(zhí)行一次refresh操作，refresh操作讓這些changes能夠被后面的搜索搜到。默認(rèn)是1s,可以通過設(shè)置為-1來禁用refresh,如果這一項沒有做顯示的設(shè)計的話，如果一個shard在 index.search.idle.after時間內(nèi)沒有接收到search請求的話，那么這個shard不會再進(jìn)行background的refresh 知道搜到search請求之后才會進(jìn)行refresh,對應(yīng)的search請求也會等到執(zhí)行完refresh之后才會執(zhí)行。這種設(shè)置主要是為了又會bulk index操作。如果不想執(zhí)行這個默認(rèn)的邏輯，可以通過顯示的設(shè)置這個屬性為1s。

index.max_result_window: 翻頁搜索的總記錄條數(shù)from+size的總數(shù)，默認(rèn)不超過10000,防止占用過高的memory,如果想進(jìn)行神翻翻頁的話嘗試使用scroll 或者 search after api。

index.max_inner_result_window: 這個和上一個類似，主要設(shè)置的是返回結(jié)果中的結(jié)果，和agg的top數(shù)，默認(rèn)是100.

index.max_rescore_window: 在搜索此索引時，rescore請求的window_size的最大值。默認(rèn)為index.max_result_window，即10000。搜索請求占用堆內(nèi)存和時間與max（window_size，from + size）成比例，這限制了內(nèi)存的使用

index.max_docvalue_fields_search: 查詢中允許的最大docvalue_field數(shù)，默認(rèn)為100，Doc-value 字段成本很高，因為它們可能會導(dǎo)致每個字段的每個文檔搜索。

index.max_script_fields: 查詢中允許的最大script_field數(shù)。默認(rèn)為32。

index.max_ngram_diff: NGramTokenizer和NGramTokenFilter的min_gram和max_gram之間允許的最大差異。默認(rèn)為1。

index.max_shingle_diff: ShingleTokenFilter的max_shingle_size和min_shingle_size之間允許的最大差異。默認(rèn)為3。

index.blocks.read_only: 設(shè)置為true以使索引和索引元數(shù)據(jù)只讀，false設(shè)置為允許寫入和元數(shù)據(jù)更改。感覺這個block和下面的幾個相反，好奇怪。

index.blocks.read_only_allow_delete: 與index.blocks.read_only相同，但允許刪除索引以釋放資源。

index.blocks.read: 設(shè)置為true以禁用對索引的讀取操作。

index.blocks.write: 設(shè)置為true以禁用對索引的數(shù)據(jù)寫入操作。與read_only不同，此設(shè)置不會影響元數(shù)據(jù)。例如，您可以使用write_block 關(guān)閉索引，但不能使用read_only_block 關(guān)閉索引。

index.blocks.metadata: 設(shè)置為true以禁用索引元數(shù)據(jù)讀取和寫入。

index.max_refresh_listeners: 索引的每個分片上可用的最大刷新偵聽器數(shù)。自定義實現(xiàn)了refresh=wait_for的listener

index.analyze.max_token_count: _analyze API能夠產(chǎn)生的token的最大數(shù)量，默認(rèn)是10000.

index.highlight.max_analyzed_offset: 高亮顯示的最大字符數(shù),這個設(shè)置僅僅針對那些indexing的時候不保存term的offset 且沒有保存term vectors的feild.

index.max_terms_count: Term查詢中可以使用的最大Term數(shù)。默認(rèn)為65536。

index.max_regex_length: 正則查詢中正則表達(dá)式的最大長度，默認(rèn)1000.

index.routing.allocation.enable: 控制此索引的分片分配。all: 默認(rèn)，允許所有分片分配； primaries:只允許主片分配；new_primaries:只允許新創(chuàng)建的primaries shard分配； none: 不允許任何分片分配

index.routing.rebalance.enable: 為此索引啟用分片重新平衡, all: 默認(rèn)，允許分片缺失等問題發(fā)生的時候進(jìn)行重新平衡，primaries: 只允許primaries進(jìn)行再平衡；replicas: 只允許replicas進(jìn)行再平衡；none: 不允許進(jìn)行再平衡,需要進(jìn)行實例測試。

index.gc_deletes: 允許已刪除文檔的版本號，仍可用于進(jìn)一步版本化操作的時間長度。默認(rèn)60s

index.default_pipeline: 設(shè)置默認(rèn)的pipeline,默認(rèn)是沒有的

index.final_pipeline: 最后的pipeling

3. index的各個功能模塊

index相關(guān)的其他的module級別的特性有

analysis相關(guān)的特性設(shè)置

index shard allocation 相關(guān)的設(shè)置

maping 設(shè)置

merging 設(shè)置，控制shard的background merge 進(jìn)程的工作方式

similarities 設(shè)置，自定義相似度計算模型來計算search的result的score

slowlog 設(shè)置，控制記錄慢查詢

store 設(shè)置，設(shè)置shard data存儲的文件系統(tǒng)的類型

tranlog ，控制transaction log和background flush 操作。

1. analysis相關(guān)的特性設(shè)置

這個不再贅述，前面有相關(guān)的文檔介紹了anlysis
主要是定義 analyzers, tokenizers, token filters and character filters.

2. index shard allocation 相關(guān)的設(shè)置

shard allocation主要控制了node上的shard分配相關(guān)的規(guī)則，他有一下能力

Shard allocation filtering: 可以控制某個shard分配到某個node

Delayed allocation: 在因為一個node離開cluster的時候會造成unassigned shard，這個設(shè)置可以控制這些unassigned的shard 延遲分配。

Total shards per node: 一個index在一個node上面最多可以有多少個shard

1. Shard allocation filtering: 可以控制某個shard分配到某個node

你可以使用shard allocation filters 來控制index的shard會被分配到哪些node上面。這個針對每個index的filters會與cluster范圍內(nèi)的allocation filter和 allocation awareness 配合使用。
shard allocation filters 可與基于node attribute, built-in _name, host_ip, publish_ip, _ip 和_host attributes進(jìn)行過濾。 Index lifecycle management使用filters來決定如何如何對shard進(jìn)行重分配。
shard allocation filter 的主要配置是 cluster.routing.allocation這個設(shè)置是dynamic的，可以使live index的shard從當(dāng)前的node上面遷移到別的上面。當(dāng)然這些遷移不能打破其他的約束，比如不能吧primary和replica shard 放到同一個node上面。

比如說，你可以自定義一些node attribute 來指明不同node的性能特性，然后使用使用shard allocation filtering 來將不同的shard route到具有不同的硬件特性的node當(dāng)中，這里適用的一個場景就是日志系統(tǒng)的冷熱分離，如果是按天產(chǎn)生的索引，可以把索引進(jìn)行冷熱分離，集群中的機器分為兩種，cold和hot， cold是大磁盤，可能直接使用機械盤，成本低，適合低寫入打存儲，提取速度要求不高的場景，hot是高性能磁盤，一般為ssd，但是磁盤的容量偏小，適合高寫入的場景，可以將昨天的日志index遷移到cold的node,今天新產(chǎn)生的分配到hot node 來滿足大量的寫需求，同時又能滿足大存儲的需求，可以降低很多成本。

1. 使用shard allocation filter的一般方式

1.給對應(yīng)的node設(shè)置attribute,假如我們?yōu)槊總€node標(biāo)記一個容量size屬性，有small,medium,big三個屬性，

node.attr.size: medium或者 ./bin/elasticsearch -Enode.attr.size=medium

為對應(yīng)的索引增加routing allocations filter

PUT test/_settings {"index.routing.allocation.include.size": "big,medium" }

2. index.routing.allocation可以有的設(shè)置

index.routing.allocation.include.{attribute}
只需要node的attribute中有一個在當(dāng)前index的配置當(dāng)中即可

index.routing.allocation.require.{attribute}
對應(yīng)的node必須有全部的當(dāng)前配置的attribute才會將分片分配上去

index.routing.allocation.exclude.{attribute}
對應(yīng)的node沒有任何當(dāng)前配置的的attribute才會將分片分配上去

PUT test/_settings {"index.routing.allocation.include.size": "big","index.routing.allocation.include.rack": "rack1" }

這個配置就會將test index 移動到rack位rack1, size為big的node上面

3. 支持的built-in attribute

_name: Match nodes by node name

_host_ip: Match nodes by host IP address (IP associated with hostname)

_publish_ip: Match nodes by publish IP address

_ip: Match either _host_ip or _publish_ip

_host: Match nodes by hostname

PUT test/_settings {"index.routing.allocation.include._ip": "192.168.2.*" }

感覺這個不常用，因為可能會變化，機器做了下線重新部署ip就變了，_name就有唯一性，不容易聚類

2. Delayed allocation

在因為一個node離開cluster的時候會造成unassigned shard，這個設(shè)置可以控制這些unassigned的shard 延遲分配。

1. node left之后master會做的操作

在一個nodeA離開cluster后，正常情況下master會做下面這些操作

將cluster中其他的node擁有的nodeA中的primary shard 對應(yīng)的replica 提升為primary(因為nodeA不在了，所有的primary也就丟了)

在node夠用的情況下從新allocating 那些丟失了的 replica shar

在剩下的node中進(jìn)行rabalance，以保證shard在集群中均勻分配

這些行為都是為了讓集群能夠避免數(shù)據(jù)丟失而且是能夠更加快速的被備份。即使es對cluster層面和node層面能夠并行恢復(fù)的shard數(shù)量，但是他還是會對cluster帶來挺大的額外的load，如果一node突然掛了，然后又很快（幾分鐘）又恢復(fù)了并重新加入了集群，急著進(jìn)行shard recovery似乎是不劃算的。

想象一下下面的場景

node5 和集群中的其他節(jié)點發(fā)生了網(wǎng)絡(luò)隔離，被認(rèn)為died

master 提升了哪些node5上的primary shard 對應(yīng)的其他node上的replica為primary shard

master 在cluster上面的其他節(jié)點上為丟失的replica allocate shard

每個replica再從primary copy數(shù)據(jù)（通過network）

reblace,shard在cluster中的不同node之間遷移，來讓集群變得更加平衡

node5在幾分鐘后又回來了

master再次進(jìn)行rebalance以便于把一些shards分配到node5上面

在這個場景下，如果master在做完第二步之后啥都不做，等待個幾分鐘，在node5回來之后，丟失的shards會re-allocated到node5上面，但是這樣的話需要通過網(wǎng)絡(luò)拷貝的數(shù)據(jù)量大大減小
對于那些已經(jīng)auto sync-flushed 的idle shard(沒有進(jìn)行index 操作),恢復(fù)會更快。

2. 延遲分配的設(shè)置

index的延遲分配可以通過 index.unassigned.node_left.delayed_timeout參數(shù)進(jìn)行設(shè)置

PUT _all/_settings {"settings": {"index.unassigned.node_left.delayed_timeout": "5m"} }

如果進(jìn)行了延遲分配設(shè)置，上面的情況就會變成這樣

node5 和集群中的其他節(jié)點發(fā)生了網(wǎng)絡(luò)隔離，被認(rèn)為died

master 提升了哪些node5上的primary shard 對應(yīng)的其他node上的replica為primary shard

master 輸出一條延遲分配的日志，記錄那些unassigned shards 以及delay time

cluster 保持yellow狀態(tài)（因為有unassigned shards）

node5 在timeout 之前重新加入了集群

node5在幾分鐘后又回來了

丟失的replica 被重新分配到node5(如果是sync-flushed shards 那么會立刻恢復(fù))

這個設(shè)置僅僅會對因為node丟失導(dǎo)致的shard missing起作用，對新索引創(chuàng)建等其他情況產(chǎn)生的沒有影響。
在整個集群重啟之后，如果重啟前有node left導(dǎo)致的shard missing那么重啟后會進(jìn)行恢復(fù)
在master 失敗的情況下，已經(jīng)經(jīng)過的dely time會丟失，然后重新計算

3. 延遲分配的觸發(fā)后的取消

如果延遲的時間到了，就會進(jìn)行shard的recovery.如果這個時間missing node又re-join 到cluster當(dāng)中了，而且他的shard 仍然和primary shard有相同的sync-id, shard relocation會被cancelled，然后原來的那個shard被用來做recovery,所以，es將默認(rèn)的delay time設(shè)置為 1 minute。

如果你要將一個node永久移除，直接將延遲設(shè)置為0即可

PUT _all/_settings {"settings": {"index.unassigned.node_left.delayed_timeout": "0"} }

然后再missing shards 開始進(jìn)行recover的時候需要將這個值重新設(shè)置回來。

4. 查看unassigned shards

有時候索引非常多，不容易發(fā)現(xiàn)到底是哪個index的哪個shard異常了，可以通過health API查看

GET _cluster/health

5. recovery的順序

the optional index.priority setting (higher before lower)

the index creation date (higher before lower)

the index name (higher before lower)
這說明默認(rèn)情況下新新建的index比舊的的index先進(jìn)行recovery

3. Total shards per node: 一個index在一個node上面最多可以有多少個shard

index.routing.allocation.total_shards_per_node
對于一個index來說，最多有多少個shards分配到單個node上面

cluster.routing.allocation.total_shards_per_node
對于集群范圍來說，單個node的shards總數(shù)最大值

這兩個設(shè)置都要謹(jǐn)慎使用，使用不當(dāng)容易出錯。

3. maping 設(shè)置

這個也不再贅述，前面已經(jīng)闡述很多了

4. merging 設(shè)置，控制shard的background merge 進(jìn)程的工作方式

每一個shard都是一個lucene index, 每個lucene又由很多個segments構(gòu)成，segement是數(shù)據(jù)存儲的基本單元，后臺會周期性的將small segment 合并為更大的segment，同時也會將delete segment刪除的文檔去掉。
這個過程會自動根據(jù)當(dāng)前的硬件資源使用的情況進(jìn)行限速throttling ,比如會考慮當(dāng)前search的壓力

使用的是ConcurrentMergeScheduler 來實現(xiàn)以上行為。merges是通過多個獨立的線程來進(jìn)行，ConcurrentMergeScheduler可以使用的最大線程數(shù)是可以設(shè)置的。

index.merge.scheduler.max_thread_count: 設(shè)置最大可以使用的線程數(shù)

如果沒有設(shè)置的話，默認(rèn)使用

Math.max(1, Math.min(4, Runtime.getRuntime().availableProcessors() / 2))

這個計算方式對于ssd來說工作的很不錯，如果你的是機械轉(zhuǎn)盤，可以把這個降低到1更好。

5. similarities 設(shè)置，自定義相似度計算模型來計算search的result的score

similarity 也被稱為scoring/ranking dodel, 主要是定義了doc如何被打分。每個field都可以定義自己的similarity。
自定義similarity可以認(rèn)為是一個專家級操作，正常情況下es的built-in similarities應(yīng)該就足夠了。
built-in similarities有

BM25 similarity

DFR similarity

DFI similarity

IB similarity

LM Dirichlet similarity

LM Jelinek Mercer similarity

Scripted similarity

在配置的時候需要注意的是,如果filed沒有特殊指出使用哪個similarity es會使用名字為 default的similarity

1. 自定義similarity的方式

可以在創(chuàng)建index的時候定一個similarity

PUT /index {"settings" : {"index" : {"similarity" : {"my_similarity" : {"type" : "DFR","basic_model" : "g","after_effect" : "l","normalization" : "h2","normalization.h2.c" : "3.0"}}}} }PUT /index/_mapping {"properties" : {"title" : { "type" : "text", "similarity" : "my_similarity" }} }

2. built-in similarities

1. BM25 similarity

可以有的配置
k1: 默認(rèn)值1.2, 非線性的term frequency 歸一化參數(shù)
b: 默認(rèn)值0.75,doc length 歸一化參數(shù)
discount_overlaps:

2. DFR similarity

3. DFI similarity

4. IB similarity

5. LM Dirichlet similarity

6. LM Jelinek Mercer similarity

7. Scripted similarity

6. slowlog 設(shè)置，控制記錄慢查詢

這個就像mysql的slow log一樣，可以記錄慢查詢/寫入

slow search

slow index

1. slow search

可以設(shè)置query和fetch兩個階段的慢日志

PUT /twitter/_settings {"index.search.slowlog.threshold.query.warn": "10s","index.search.slowlog.threshold.query.info": "5s","index.search.slowlog.threshold.query.debug": "2s","index.search.slowlog.threshold.query.trace": "500ms","index.search.slowlog.threshold.fetch.warn": "1s","index.search.slowlog.threshold.fetch.info": "800ms","index.search.slowlog.threshold.fetch.debug": "500ms","index.search.slowlog.threshold.fetch.trace": "200ms","index.search.slowlog.level": "info" }

使用不同的日志等級是為了方便更快的進(jìn)行g(shù)rep操作。

對應(yīng)的log4j2.properties配置為

appender.index_search_slowlog_rolling.type = RollingFile appender.index_search_slowlog_rolling.name = index_search_slowlog_rolling appender.index_search_slowlog_rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_search_slowlog.log appender.index_search_slowlog_rolling.layout.type = PatternLayout appender.index_search_slowlog_rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c] [%node_name]%marker %.-10000m%n appender.index_search_slowlog_rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_search_slowlog-%i.log.gz appender.index_search_slowlog_rolling.policies.type = Policies appender.index_search_slowlog_rolling.policies.size.type = SizeBasedTriggeringPolicy appender.index_search_slowlog_rolling.policies.size.size = 1GB appender.index_search_slowlog_rolling.strategy.type = DefaultRolloverStrategy appender.index_search_slowlog_rolling.strategy.max = 4logger.index_search_slowlog_rolling.name = index.search.slowlog logger.index_search_slowlog_rolling.level = trace logger.index_search_slowlog_rolling.appenderRef.index_search_slowlog_rolling.ref = index_search_slowlog_rolling logger.index_search_slowlog_rolling.additivity = false

2. slow index

PUT /twitter/_settings {"index.indexing.slowlog.threshold.index.warn": "10s","index.indexing.slowlog.threshold.index.info": "5s","index.indexing.slowlog.threshold.index.debug": "2s","index.indexing.slowlog.threshold.index.trace": "500ms","index.indexing.slowlog.level": "info","index.indexing.slowlog.source": "1000" }

log4j2.properties配置

appender.index_indexing_slowlog_rolling.type = RollingFile appender.index_indexing_slowlog_rolling.name = index_indexing_slowlog_rolling appender.index_indexing_slowlog_rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_indexing_slowlog.log appender.index_indexing_slowlog_rolling.layout.type = PatternLayout appender.index_indexing_slowlog_rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c] [%node_name]%marker %.-10000m%n appender.index_indexing_slowlog_rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_indexing_slowlog-%i.log.gz appender.index_indexing_slowlog_rolling.policies.type = Policies appender.index_indexing_slowlog_rolling.policies.size.type = SizeBasedTriggeringPolicy appender.index_indexing_slowlog_rolling.policies.size.size = 1GB appender.index_indexing_slowlog_rolling.strategy.type = DefaultRolloverStrategy appender.index_indexing_slowlog_rolling.strategy.max = 4logger.index_indexing_slowlog.name = index.indexing.slowlog.index logger.index_indexing_slowlog.level = trace logger.index_indexing_slowlog.appenderRef.index_indexing_slowlog_rolling.ref = index_indexing_slowlog_rolling logger.index_indexing_slowlog.additivity = false

7. store 設(shè)置，設(shè)置shard data存儲的文件系統(tǒng)的類型

1. store的類型和設(shè)置

這個設(shè)置主要是針對文件系統(tǒng)的一些設(shè)置，物理機的文件系統(tǒng)是各種各樣的，默認(rèn)情況下es會根據(jù)操作系統(tǒng)的實際情況選擇最佳讀寫方式,你也可以對這個進(jìn)行設(shè)置。

這個設(shè)置是static的,而且這個也是一個專家級設(shè)置，有可能后面會移除這個設(shè)置

這個設(shè)置可以直接在elasticsearch.yml中設(shè)置全局的

index.store.type: niofs

也可以針對單個的index設(shè)置

PUT /my_index {"settings": {"index.store.type": "niofs"} }

可以有以下設(shè)置：

fs: 默認(rèn)文件系統(tǒng)實現(xiàn)。這將根據(jù)操作環(huán)境選擇最佳的實現(xiàn)方式，當(dāng)前的操作環(huán)境在所有受支持的系統(tǒng)上都是hybridfs，但可能會發(fā)生變化。
simplefs: Simple FS類型是使用隨機訪問文件直接實現(xiàn)文件系統(tǒng)存儲（映射到Lucene SimpleFsDirectory）。此實現(xiàn)的并行性能較差（多個線程將成為瓶頸）。當(dāng)您需要索引持久性時，通常最好使用niofs。
niofs: NIO FS類型使用NIO在文件系統(tǒng)上存儲分片索引（映射到Lucene NIOFSDirectory）。它允許多個線程同時讀取同一文件。由于SUN Java實現(xiàn)中存在bug，因此不建議在Windows上使用。
mmapfs: MMap FS類型通過將文件映射到內(nèi)存（mmap）將碎片索引存儲在文件系統(tǒng)上（映射到Lucene MMapDirectory）。內(nèi)存映射會占用進(jìn)程中虛擬內(nèi)存地址空間的一部分，該空間等于要映射的文件的大小。在使用此類之前，請確保您已允許足夠的虛擬地址空間。
hybridfs: hybridfs類型是niofs和mmapfs的混合類型，它根據(jù)讀取訪問模式為每種文件類型選擇最佳的文件系統(tǒng)類型。當(dāng)前，只有Lucene term 詞典，norms 和doc values 文件才進(jìn)行內(nèi)存映射(mmap)。使用Lucene NIOFSDirectory打開所有其他文件。與mmapfs相似，請確保已允許大量虛擬地址空間。

如果你沒有權(quán)限使用大量的memory maps 你可以通過node.store.allow_mmap 來設(shè)置，這個是一個boolean值。默認(rèn)是true，你可以設(shè)置為false。

2. 預(yù)加載數(shù)據(jù)導(dǎo)文件系統(tǒng)的緩存中

這個也是專家級設(shè)置，將來有可能修改。
默認(rèn)情況下，Elasticsearch完全依靠操作系統(tǒng)文件系統(tǒng)緩存來緩存I / O操作。可以通過設(shè)置index.store.preload，以告知操作系統(tǒng)在opening索引的時候?qū)⑺饕募膬?nèi)容加載到內(nèi)存中。
此設(shè)置接受以逗號分隔的文件擴展名列表：擴展名在列表中的所有文件將在打開時預(yù)加載。這對提高索引的搜索性能很有用，尤其是在重新啟動主機操作系統(tǒng)時，因為這會導(dǎo)致文件系統(tǒng)緩存被破壞。
但是請注意，這可能會減慢索引的打開速度，因為只有將數(shù)據(jù)加載到物理內(nèi)存中后，索引才能變得可用。

這個設(shè)置可以直接在elasticsearch.yml中設(shè)置全局的

index.store.preload: ["nvd", "dvd"]

也可以針對單個的index設(shè)置

PUT /my_index {"settings": {"index.store.preload": ["nvd", "dvd"]} }

里面的設(shè)置也支持wildcard的設(shè)置。

8. tranlog ，控制transaction log和background flush 操作。

這一部分就結(jié)合es的refresh,flush操作一起來理解好了。

1. es shard和lucene的關(guān)系

一個lucene 的shard(后面稱es_shard)在lucene中對應(yīng)了一個索引index(后面稱lucene_index)
lucene_index 的構(gòu)成是由多個segment構(gòu)成的。

2. luncene_index滿足事務(wù)特性，

當(dāng)IndexReader.open打開一個索引的時候，相對于給當(dāng)前索引進(jìn)行了一次snapshot，此后的任何修改都不會被看到。

僅當(dāng)IndexReader.open打開一個索引后，才有可能看到從上次打開后對索引的修改。

當(dāng)IndexWriter沒有調(diào)用Commit的時候，其修改的內(nèi)容是不能夠被看到的，哪怕IndexReader被重新打開。

欲使最新的修改被看到，一方面IndexWriter需要commit(產(chǎn)生新的segment)，一方面IndexReader重新打開。

lucene commit ：lucene commit針對的是lucene_index不是某一個segment，會應(yīng)用新的curd , merge 一些segment產(chǎn)生新的luncene_index 的segment,并持久化到磁盤。

lucene reopen : 想要新的增改刪可以應(yīng)用到查詢中，比如進(jìn)行reopen才行。

也就是所es想要新的內(nèi)容可見的話理論上必須有一個commit+reopen的操作。實際上上這樣做是比較耗時的，在這里可以簡略的理解為es的一個可能實現(xiàn)。

3. es如何應(yīng)對增刪改

es為了追求更好的近實時性，引入了tranlog。每一個增刪改請求進(jìn)來后會生成兩份，一份是記錄到translog當(dāng)中，一份是記錄到in-memory buffer當(dāng)中。
當(dāng)執(zhí)行_refresh操作的時候（es默認(rèn)每秒執(zhí)行一次），in-memory-buffer 會被copy生成一個新的memory-segment,這個時候應(yīng)該做了一些優(yōu)化（實現(xiàn)了更快的類似commit+reopen）
這個memory-segment隨后就是searchable的了。但是這個時候memory-segment并沒有被持久化。這個時候如果服務(wù)崩了就可以通過translog來進(jìn)行數(shù)據(jù)回放重建。translog可以設(shè)置為對Index, Bulk, Delete, or Update 在響應(yīng)前都進(jìn)行持久化，也可以設(shè)置為異步持久化（有丟數(shù)據(jù)風(fēng)險）。

flush 對應(yīng)的本質(zhì)實際上是一個lucene的commit操作，他將memory-segment merge產(chǎn)生新的segment寫到磁盤,同時創(chuàng)建新的translog文件（不會直接刪除老的translog）,這是一個相對昂貴的操作。
es沒有刪除translog主要是為了在replica從primary復(fù)制的時候為了加快復(fù)制速度有時候直接通過傳輸translog文件來加快recovery的過程。

4. translog的setting

index.translog.durability: 這個設(shè)置的是translog的durability（持久化）方式，默認(rèn)的配置是request,也就是意味著es對于index, delete, update, or bulk 請求，只有translog在primary和所有的replica上完成了持久化才會給client返回成功。他還可以被設(shè)置為 async,這樣的話就會對translog進(jìn)行一步的fsyncs,時間是 index.translog.sync_interval(默認(rèn)to 5 seconds).

index.translog.sync_interval: 默認(rèn)是5s，不能小于100ms

index.translog.flush_threshold_size: 進(jìn)行flush操作的translog閾值，為了防止在shard recovery的時候通過大量的translog重建（相對較慢），會在translog達(dá)到一定的大小后進(jìn)行l(wèi)ucene commit 操作，把translog中的內(nèi)容應(yīng)用到磁盤當(dāng)中。

index.translog.retention.size: 這個設(shè)置的translog所有文件總的最大大小,保持多一些tranlog文件能夠在replica恢復(fù)的時候直接通過網(wǎng)絡(luò)拷貝primary的translog加快數(shù)據(jù)同步的過程。如果translog的比較低效，還是會走通過segment的文件進(jìn)行同步。默認(rèn)的大小是512mb，超過了之后會刪除舊的文件。

index.translog.retention.age: tranlog文件最長保留時長，默認(rèn)是12h.

這篇文章將es的refresh,flush,translog之間的關(guān)系講的比較清楚
https://qbox.io/blog/refresh-flush-operations-elasticsearch-guide
https://www.jianshu.com/p/15837be98ffd

劉大佬的這篇文章可以作為一個注腳
https://www.cnblogs.com/forfuture1978/archive/2010/06/08/1753642.html
https://www.cnblogs.com/forfuture1978/archive/2010/06/27/1766162.html

luncene 的事務(wù)特性
https://www.cnblogs.com/forfuture1978/archive/2010/06/07/1752917.html

9. index-sorting

1. index-sorting簡介

這個的作用是在lucene創(chuàng)建segment的時候指定文檔的排列順序，默認(rèn)情況下lucene是按照index的先后直接排列的，沒有固定的規(guī)則。
這個博客對index-sorting的特點介紹的很完整,
這篇lucene 也很好

index-sortring的功能主要就是在生成segment的時候使文檔按照某個field排序后的值進(jìn)行排列,這樣的好處是doc_id的順序和該field的順序是一致的。在進(jìn)行sort取topN的時候，只需要取每個segment的topN即可。

使用,下面是一個使用了多個字段排序的segment

PUT twitter {"settings" : {"index" : {"sort.field" : ["username", "date"], "sort.order" : ["asc", "desc"] }},"mappings": {"properties": {"username": {"type": "keyword","doc_values": true},"date": {"type": "date"}}} }

2. 可以有以下屬性

index.sort.field: 是一個list，標(biāo)識按照哪些fields進(jìn)行排序
index.sort.order: 對每個field的排序規(guī)則，asc, desc
index.sort.mode: es可以使用multi-valued fields，也就是說這個field的值有可能是一個array, 這個時候可以選擇使用選擇array中的哪一個參與排序，可以有min,max兩個選項，分別標(biāo)識使用最小值和最大值。
index.sort.missing: 對于沒有排序字段的doc如何處理，有兩個選項_last, _first放在最后一位或者第一位

index-sorting只能在index create的時候指定，不能再已經(jīng)創(chuàng)建過的index上進(jìn)行設(shè)置或者update.

3. index-soring的主要功能

可以提前結(jié)束查詢過程，返回查詢結(jié)果。

PUT events {"settings" : {"index" : {"sort.field" : "timestamp","sort.order" : "desc" }},"mappings": {"properties": {"timestamp": {"type": "date"}}} }

默認(rèn)情況下如果沒有設(shè)置index-sorting ,es的一個request會遍歷所有query命中的doc，根據(jù)doc id取出來sorted field對應(yīng)的doc_value 然后排序，再取前N條。但是假如設(shè)置了index-soring,同時，查詢使用的sort又是同樣的field的話，這樣的話可以只遍歷前面N個doc即可。

請求樣例

GET /events/_search {"size": 10,"sort": [{ "timestamp": "desc" }] }

這個查詢因為沒有query，所以lucene會直接去取每個segment的前N條即可，剩下的會被收集用來計算total_number,假如不需要total_number的話可以在查詢當(dāng)中設(shè)置 "track_total_hits": false
這樣的話es在找到N個doc后就立即返回相對來說快了很多。
如果query里面有agg操作的話，會忽略"track_total_hits": false的設(shè)置，還是會獲取所有命中的doc。

GET /events/_search {"size": 10,"sort": [ { "timestamp": "desc" }],"track_total_hits": false } {"_shards": ..."hits" : { "max_score" : null,"hits" : []},"took": 20,"timed_out": false }

這里順便提一下lucene的查詢機制，lucene是以segment作為查詢單位的，每個segment也被稱為sub-index。

索引排序?qū)τ诮M織Lucene doc ID（不要和es中的_id弄混了）很有用，其方式是使AND 查詢（a AND b AND…）更有效。為了高效，AND 查詢依賴于以下事實：如果任何子查詢不匹配，則整個查詢都不匹配。通過使用索引排序，我們可以將不匹配的文檔放到一起，這將有助于有效地跳過與連接符不匹配的大范圍文檔ID。

此技巧僅適用于低基數(shù)字段(也就是說這個字段的值只有有限個數(shù)，但是doc的數(shù)量可以很大)。一條經(jīng)驗法則是，您應(yīng)首先對基數(shù)都很低且經(jīng)常用于過濾的字段進(jìn)行排序。排序順序（升序或降序）無所謂，因為我們只關(guān)心將與相同子句匹配的值彼此靠近。

例如，如果您要索引要出售的汽車，則按燃料類型，車身類型，品牌，注冊年份以及最終里程來分類可能會很有用。

For instance if you were indexing cars for sale, it might be interesting to sort by fuel type, body type, make, year of registration and finally mileage.

總結(jié)

以上是生活随笔為你收集整理的05.elasticsearch-index相关总结的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 04.elasticsearch-dyn
下一篇： 01.elasticsearch请求使用