當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

理解Lucene/Solr的缓存

發布時間：2023/11/29 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了理解Lucene/Solr的缓存小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

????緩存對于提高搜索引擎的吞吐量，降低CPU占用率極為重要。Lucene/Solr在這塊做了很多的工作。Lucene/Solr中默認提供了5種緩存，同時solr還提供擴展緩存接口，允許開發者自定義緩存。

1?緩存的基本原理

Solr實現了兩種策略的緩存：LRU(Leatest?Recently?Used)和LFU(Least?Frequently?Used)。這兩種策略也用于操作系統的內存管理(頁面置換)。當然緩存還有其它的策略，比如FIFO、Rand等。無論是基于什么樣的策略，在應用中命中率高且實現簡單的策略才是好策略。

1.1?LRU策略

LRU，又稱最近最少使用。假如緩存的容量為10，那么把緩存中的對象按訪問(插入)的時間先后排序，當容量不足時，***時間最早的。(當然，真正的實現是通過鏈表維護時間先后順序)

1.1.1?LRUCache

Solr中LRUCache是通過LinkedHashMap來實現的。通過LRUCache的init方法就可以發現，其代碼如下：

??map?=?new?LinkedHashMap<K,V>(initialSize,?0.75f,?true)?{@Overrideprotected?boolean?removeEldestEntry(Map.Entry?eldest)?{if?(size()?>?limit)?{//?increment?evictions?regardless?of?state.//?this?doesn't?need?to?be?synchronized?because?it?will//?only?be?called?in?the?context?of?a?higher?level?synchronized?block.evictions++;stats.evictions.incrementAndGet();return?true;}return?false;}};

需要注意的是其構造參數的最后一個accessOrder。這里accessOrder=true，表明map.get()方法會改變鏈表的結構，如果accessOrder為false，則map.get()方法不對改變LinkedHashMap中鏈表的結構，就無法體現最近最小使用這個特點了。

由于LRUCache其本質是LinkedHashMap,而HashMap不是線程安全的，所以就需要在get和put時進行同步，鎖住整個map，所以在高并發條件下，其性能會有所影響。因此Solr用另外一種方式實現了LRUCache，即FastLRUCache。

1.1.2?FastLRUCache

FastLRUCache內部采用了ConcurrentLRUCache實現，而ConcurrentLRUCache內部又采用ConcurrentHashMap實現，所以是線程安全的。緩存通過CacheEntry中的訪問標記lastAccessed來維護CacheEntry被訪問的先后順序。?即每當Cache有get或者put操作，則當前CacheEntry的lastAccessed都會變成最大的(state.accessCounter)。當FastLRUCache容量已滿時，通過markAndSweep方式來剔除緩存中lastAccessed最小的N個項以保證緩存的大小達到一個acceptable的值。

markAndSweep分兩個階段執行：第一階段收回最近最少使用的項；如果經過第一階段緩存的大小依然大于acceptable，那么第二階段將會開始。第二階段會更加嚴格地把緩存的大小降下來。

在第一階段，一個數軸就可以把運行原理解釋清楚。

對應代碼如下(見ConcurrentLRUCache.markAndSweep方法)

????????

//?since?the?wantToKeep?group?is?likely?to?be?bigger?than?wantToRemove,?check?it?firstif?(thisEntry?>?newestEntry?-?wantToKeep)?{//?this?entry?is?guaranteed?not?to?be?in?the?bottom//?group,?so?do?nothing.numKept++;newOldestEntry?=?Math.min(thisEntry,?newOldestEntry);}?else?if?(thisEntry?<?oldestEntry?+?wantToRemove)?{?//?entry?in?bottom?group?//?this?entry?is?guaranteed?to?be?in?the?bottom?group//?so?immediately?remove?it?from?the?map.evictEntry(ce.key);numRemoved++;}?else?{//?This?entry?*could*?be?in?the?bottom?group.//?Collect?these?entries?to?avoid?another?full?pass...?this?is?wasted//?effort?if?enough?entries?are?normally?removed?in?this?first?pass.//?An?alternate?impl?could?make?a?full?second?pass.if?(eSize?<?eset.length-1)?{eset[eSize++]?=?ce;newNewestEntry?=?Math.max(thisEntry,?newNewestEntry);newOldestEntry?=?Math.min(thisEntry,?newOldestEntry);}}}

?????

看代碼可知，第一階段會按相同的邏輯運行兩次。一般來說，經過第一階段，緩存的大小應該控制下來了。如果依然控制不下來，那么就把上圖中的待定Entry直接扔到指定大小的優先隊列中。最后把優先隊列中的Entry全部***。這樣，就能夠保證緩存的Size降下來。其實如果一開始就直接上優先隊列，代碼會少很多。但是程序的性能會降低好多。

通過分析可以看到，如果緩存中put操作頻繁，很容易觸發markAndSweep方法的執行。而markAndSweep操作比較耗時。所以這部分的操作可以通過設置newThreadForCleanup=true來優化。即新開一個線程執行。這樣就不會阻塞put方法。在solrconfig.xml中配置，是這樣的cleanupThread=true。Cache在構造的時候就會開啟一個線程。通過線程的wait/nofity來控制markAndSweep。從而避免了newThreadForCleanup=true這樣的不停開線程的開銷，總而言之，緩存是通過markAndSweep來控制容量。

1.2?LFU策略

LFU策略即【最近最少使用】策略。當緩存已滿時，設定時間段內使用次數最少的緩存將被剔除出去。通過前面的描述，容易看出LFU策略實現時，必須有一個計數器來記錄Cache的Entry被訪問的次數。Solr也正是這么干的。(看CacheEntry結構)

?private?static?class?CacheEntry<K,?V>?implements?Comparable<CacheEntry<K,?V>>?{K?key;V?value;volatile?AtomicLong?hits?=?new?AtomicLong(0);long?hitsCopy?=?0;volatile?long?lastAccessed?=?0;long?lastAccessedCopy?=?0;public?CacheEntry(K?key,?V?value,?long?lastAccessed)?{this.key?=?key;this.value?=?value;this.lastAccessed?=?lastAccessed;}

很清楚地看到CacheEntry用hits?來記錄訪問次數。lastAccessed?存在則是為了應付控制緩存容量時，如果在待***隊列中出現hits相同的CacheEntry，那么***lastAccessed?較小的一個。hitsCopy?和lastAccessedCopy的存在則是基于性能的考慮。避免多線程時內存跨越內存柵欄。

LFUCache通過ConcurrentLFUCache來實現，而ConcurrentLFUCache內部又是ConcurrentHashMap。我們關注的重點放在ConcurrentLFUCache。

ConcurrentLFUCache對容量的控制依然是markAndSweep,我猜想這是為了在代碼可讀性上與ConcurrentLRUCache保持一致。

相對ConcurrentLRUCache的markAndSweep實現而言，ConcurrentLFUCache的markAndSweep就比較簡單了。用一個TreeSet來維護待***隊列。TreeSet排序則是基于hits?和lastAccessed?。(可參看CacheEntry的comparTo方法)

markAndSweep方法的核心代碼如下：

TreeSet<CacheEntry>?tree?=?new?TreeSet<CacheEntry>();for?(CacheEntry<K,?V>?ce?:?map.values())?{//?set?hitsCopy?to?avoid?later?Atomic?readsce.hitsCopy?=?ce.hits.get();ce.lastAccessedCopy?=?ce.lastAccessed;if?(timeDecay)?{ce.hits.set(ce.hitsCopy?>>>?1);}if?(tree.size()?<?wantToRemove)?{tree.add(ce);}?else?{//?If?the?hits?are?not?equal,?we?can?remove?before?adding//?which?is?slightly?fasterif?(ce.hitsCopy?<?tree.first().hitsCopy)?{tree.remove(tree.first());tree.add(ce);}?else?if?(ce.hitsCopy?==?tree.first().hitsCopy)?{tree.add(ce);tree.remove(tree.first());}}}for?(CacheEntry<K,?V>?e?:?tree)?{evictEntry(e.key);}

Solr實現了LFUCache，卻沒有再來一個FastLFUCache。因為LFUCache的實現用的是ConcurrentHashMap。能夠很好的支持并發。如果非要來一個FastLFUCache，那么就得用上非阻塞數據結構了。

2?緩存在Solr的中應用

前面已經提到過，Solr實現了各種層次的緩存。緩存由SolrIndexSearcher集中控制。分別應用在query、fact等查詢相關的操作上。

2.1?filterCache

filterCache在SolrIndexSearcher的定義如下：

SolrCache<Query,DocSet>?filterCache;

???filterCache的key是Query,value是DocSet對象。而DocSet的基本功能就是過濾。filter在英語中的解釋是"過濾器"。那么哪些地方有可能用到過濾功能呢?

filterCache在solr中的應用包含以下場景：

1、查詢參數facet.method=enum

2、如果solrconfig.xml中配置<useFilterForSortedQuery/>?為true

3、查詢參數含Facet.query或者group.query

4、查詢參數含fq

????

2.2?fieldvalueCache

fieldValueCache在SolrIdexSearcher的定義如下：

SolrCache<String,UnInvertedField>?fieldValueCache;

其中key代表FieldName，value是一種數據結構UnInvertedField。

fieldValueCache在solr中只用于multivalued?Field。一般用到它的就是facet操作。關于這個緩存需要注意的是，如果沒有在solrconfig.xml中配置，那么它是默認存在的(初始大小10，最大10000，不會autowarm)?會有內存溢出的隱患。

由于該cache的key為FieldName，而一般一個solrCore中的字段最多也不過幾百。在這么多字段中，multivalued?字段會更少，會用到facet操作的則少之又少。所以該在solrconfig.xml中的配置不必過大，大了也是浪費。

該緩存存儲排序好的docIds，一般是topN。這個緩存占用內存會比filterCache?小。因為它存儲的是topN。但是如果QueryCommand中帶有filter(DocSet類型)，那么該緩存不會起作用。原因是：DocSet在執行hashcode和equals方法時比較耗時。

2.4?documentCache

該緩存映射docId->Document。沒有什么值得多說的。

2.5?自定義緩存

如果solr中實現的緩存不滿足需求。那么可以在SolrConfig.xml中自定義緩存。?

<cache?name="c"class="solr.FastLRUCache"size="4096"initialSize="1024"autowarmCount="1024"regenerator="com.mycompany.cache.CacheRegenerator"/>

需要寫代碼的地方就是?regenerator="com.mycompany.cache.CacheRegenerator"這里了。Regenerator在SolrIndexSearcher執行warm方法時會被調用。假如solr的索引2分鐘更新一次，為了保證更新的索引能夠被搜索到，那么就需要重新打開一個SolrIndexSearcher，這時候就有一個問題：SolrIndexSearcher里面的緩存怎么辦？

如果把舊的緩存全部拋棄，那么搜索的性能勢必下降。Solr的做法是通過warm方法來預熱緩存。即把通過原有緩存里面的Key值，重新獲取一次value。warm完畢后再切換到新的Searcher。regenrator里面的regenerateItem方法就是用來更新緩存。關注一下regenerateItem的參數：

??public?boolean?regenerateItem(SolrIndexSearcher?newSearcher,?SolrCache?newCache,?SolrCache?oldCache,?Object?oldKey,?Object?oldVal)?throws?IOException;

有SolrIndexSearcher,有oldCache,有oldKey，有oldVal想查詢結果很容易就能得到了。這樣做的話已經***到Solr內部了，不推薦。如果以后想要升級的話，可能得重新改代碼。升級維護不太方便。

2.6?fieldCache

我們知道lucene保存了正向索引(docId-->field)和反向索引(field-->docId)。反向索引是搜索的核心，檢索速度很快。但是如果我們需要快速由docId得到Field信息(比如按照某個字段排序，字段值的信息統計<solr?facet功能>)，由于需要磁盤讀取，速度會比較慢。因此Lucene實現了fieldCache。

Lucene實現了各種類型Field的緩存：Byte,Short,Int,Float,Long……

fieldCache是Lucene內部的緩存，主要用于緩存Lucene搜索結果排序，比如按時間排序等。由于fieldCache內部利用數組來存儲數據(可以參看FieldCacheImpl源碼)，而且數組的大小開的都是maxDoc，所以當數據量較大時，fieldCache是相當消耗內存的，所以很容易出現內存溢出問題。

fieldCache使用的樣例可可參看如下的源代碼。

package?com.vancl.cache;import?java.io.IOException;import?org.apache.lucene.analysis.Analyzer; import?org.apache.lucene.analysis.core.WhitespaceAnalyzer; import?org.apache.lucene.document.Document; import?org.apache.lucene.document.Field.Store; import?org.apache.lucene.document.IntField; import?org.apache.lucene.document.StringField; import?org.apache.lucene.index.DirectoryReader; import?org.apache.lucene.index.IndexReader; import?org.apache.lucene.index.IndexWriter; import?org.apache.lucene.index.IndexWriterConfig; import?org.apache.lucene.search.IndexSearcher; import?org.apache.lucene.search.MatchAllDocsQuery; import?org.apache.lucene.search.ScoreDoc; import?org.apache.lucene.search.Sort; import?org.apache.lucene.search.SortField; import?org.apache.lucene.search.TopDocs; import?org.apache.lucene.search.TopFieldCollector; import?org.apache.lucene.store.Directory; import?org.apache.lucene.store.RAMDirectory; import?org.apache.lucene.util.Version;public?class?TestFieldCache?{Directory?d=?new?RAMDirectory();Analyzer?analyzer?=new?WhitespaceAnalyzer(Version.LUCENE_42);IndexWriterConfig?conf?=?null;IndexWriter?iw?=?null;public?void?index()?throws?IOException{conf?=?new?IndexWriterConfig(Version.LUCENE_42,analyzer);iw?=?new?IndexWriter(d,?conf);Document?doc?=?null;int[]?ids?={1,5,3,2,4,8,6,7,9,10};String[]?addTimes={"2012-12-12?12:12:12","2012-12-12?12:12:13","2012-12-12?12:12:14","2012-12-12?12:12:15","2012-12-12?12:12:11","2012-12-12?12:12:10","2012-12-12?12:12:09","2012-12-12?12:12:08","2012-12-12?12:12:07","2012-12-12?12:12:06"} ;for(int?i=1;i<=10;i++){doc=new?Document();doc.add(new?StringField("addTime",addTimes[i-1],?Store.YES));doc.add(new?IntField("id",ids[i-1],?Store.YES));iw.addDocument(doc);}iw.commit();iw.close();}public?void?query()?throws?IOException{IndexReader?ir?=?DirectoryReader.open(d);IndexSearcher?is?=?new?IndexSearcher(ir);//按addTime逆序排序//Sort?sort?=?new?Sort(new?SortField("addTime",?SortField.Type.STRING,true));Sort?sort?=?new?Sort(new?SortField("addTime",?SortField.Type.STRING,true));//按id逆序排序//Sort?sort?=?new?Sort(new?SortField("id",?SortField.Type.INT,true));TopFieldCollector?collector?= TopFieldCollector.create(sort,?5,?false,?false,?false,?false);is.search(new?MatchAllDocsQuery(),collector);TopDocs?top=?collector.topDocs();for?(ScoreDoc?doc?:?top.scoreDocs)?{// System.out.println(ir.document(doc.doc).get("id"));System.out.println(ir.document(doc.doc).get("addTime"));}}public?static?void?main(String[]?args)?throws?IOException?{TestFieldCache?c?=?new?TestFieldCache();c.index();c.query();} }

轉載于:https://blog.51cto.com/sbp810050504/1421546

總結

以上是生活随笔為你收集整理的理解Lucene/Solr的缓存的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：女人梦到玫瑰花预示着什么
下一篇：开源免费 java CMS - Fre