當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lucene的评分(score)机制研究

發(fā)布時(shí)間：2025/4/5 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Lucene的评分(score)机制研究小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

首先，需要學(xué)習(xí)Lucene的評(píng)分計(jì)算公式——

分值計(jì)算方式為查詢語句q中每個(gè)項(xiàng)t與文檔d的匹配分值之和，當(dāng)然還有權(quán)重的因素。其中每一項(xiàng)的意思如下表所示：

表3.5	評(píng)分公式中的因子
評(píng)分因子	描?述
tf(t in d)	項(xiàng)頻率因子——文檔（d)中出現(xiàn)項(xiàng)（t)的頻率
idf(t)	項(xiàng)在倒排文檔中出現(xiàn)的頻率：它被用來衡量項(xiàng)的“唯一”性.出現(xiàn)頻率較高的term具有較低的idf,出現(xiàn)較少的term具有較高的idf
boost(t.field in d)	域和文檔的加權(quán)，在索引期間設(shè)置.你可以用該方法對(duì)某個(gè)域或文檔進(jìn)行靜態(tài)單獨(dú)加權(quán)
lengthNorm(t.field in d)	域的歸一化（Normalization)值，表示域中包含的項(xiàng)數(shù)量.該值在索引期間計(jì)算，并保存在索引norm中.對(duì)于該因子，更短的域（或更少的語匯單元）能獲得更大的加權(quán)
coord(q,d)	協(xié)調(diào)因子（Coordination factor),基于文檔中包含查詢的項(xiàng)個(gè)數(shù).該因子會(huì)對(duì)包含更多搜索項(xiàng)的文檔進(jìn)行類似AND的加權(quán)
queryNorm(q)	每個(gè)査詢的歸一化值，指毎個(gè)查詢項(xiàng)權(quán)重的平方和 ?

通過Searcher.explain(Query query, int doc)方法可以查看某個(gè)文檔的得分的具體構(gòu)成。示例：

public class ScoreSortTest {public final static String INDEX_STORE_PATH = "index";public static void main(String[] args) throws Exception {IndexWriter writer = new IndexWriter(INDEX_STORE_PATH, new StandardAnalyzer(), true);writer.setUseCompoundFile(false);Document doc1 = new Document();Document doc2 = new Document();Document doc3 = new Document();Field f1 = new Field("bookname","bc bc", Field.Store.YES, Field.Index.TOKENIZED);Field f2 = new Field("bookname","ab bc", Field.Store.YES, Field.Index.TOKENIZED);Field f3 = new Field("bookname","ab bc cd", Field.Store.YES, Field.Index.TOKENIZED);doc1.add(f1);doc2.add(f2);doc3.add(f3);writer.addDocument(doc1);writer.addDocument(doc2);writer.addDocument(doc3);writer.close();IndexSearcher searcher = new IndexSearcher(INDEX_STORE_PATH);TermQuery q = new TermQuery(new Term("bookname", "bc"));q.setBoost(2f);Hits hits = searcher.search(q);for(int i=0; i<hits.length();i++){Document doc = hits.doc(i);System.out.print(doc.get("bookname") + "\t\t");System.out.println(hits.score(i));System.out.println(searcher.explain(q, hits.id(i)));// }} }

運(yùn)行結(jié)果：?

bc bc 0.629606 0.629606 = (MATCH) fieldWeight(bookname:bc in 0), product of: 1.4142135 = tf(termFreq(bookname:bc)=2) 0.71231794 = idf(docFreq=3, numDocs=3) 0.625 = fieldNorm(field=bookname, doc=0) ab bc 0.4451987 0.4451987 = (MATCH) fieldWeight(bookname:bc in 1), product of: 1.0 = tf(termFreq(bookname:bc)=1) 0.71231794 = idf(docFreq=3, numDocs=3) 0.625 = fieldNorm(field=bookname, doc=1) ab bc cd 0.35615897 0.35615897 = (MATCH) fieldWeight(bookname:bc in 2), product of: 1.0 = tf(termFreq(bookname:bc)=1) 0.71231794 = idf(docFreq=3, numDocs=3) 0.5 = fieldNorm(field=bookname, doc=2)

涉及到的源碼：

idf的計(jì)算

idf是項(xiàng)在倒排文檔中出現(xiàn)的頻率，計(jì)算方式為

/**?Implemented?as?<code>log(numDocs/(docFreq+1))?+?1</code>.?*/

??@Override

??public?float?idf(long?docFreq,?long?numDocs)?{

????return?(float)(Math.log(numDocs/(double)(docFreq+1))?+?1.0);

??}

docFreq是根據(jù)指定關(guān)鍵字進(jìn)行檢索，檢索到的Document的數(shù)量，我們測(cè)試的docFreq=14；numDocs是指索引文件中總共的Document的數(shù)量，我們測(cè)試的numDocs=1453。用計(jì)算器驗(yàn)證一下，沒有錯(cuò)誤，這里就不啰嗦了。

queryNorm的計(jì)算

queryNorm的計(jì)算在DefaultSimilarity類中實(shí)現(xiàn)，如下所示：

/**?Implemented?as?<code>1/sqrt(sumOfSquaredWeights)</code>.?*/

public?float?queryNorm(float?sumOfSquaredWeights)?{

????return?(float)(1.0?/?Math.sqrt(sumOfSquaredWeights));

}

這里，sumOfSquaredWeights的計(jì)算是在org.apache.lucene.search.TermQuery.TermWeight類中的sumOfSquaredWeights方法實(shí)現(xiàn)：

? ??

public?float?sumOfSquaredWeights()?{

??????queryWeight?=?idf?*?getBoost();?????????????//?compute?query?weight

??????return?queryWeight?*?queryWeight;??????????//?square?it

????}

其實(shí)默認(rèn)情況下，sumOfSquaredWeights = idf * idf，因?yàn)長ucune中默認(rèn)的boost = 1.0。

fieldWeight的計(jì)算

在org/apache/lucene/search/similarities/TFIDFSimilarity.java的explainScore方法中有：

//?explain?field?weight

????Explanation?fieldExpl?=?new?Explanation();

????fieldExpl.setDescription("fieldWeight?in?"+doc+

?????????????????????????????",?product?of:");

????Explanation?tfExplanation?=?new?Explanation();

????tfExplanation.setValue(tf(freq.getValue()));

????tfExplanation.setDescription("tf(freq="+freq.getValue()+"),?with?freq?of:");

????tfExplanation.addDetail(freq);

????fieldExpl.addDetail(tfExplanation);

????fieldExpl.addDetail(stats.idf);

????Explanation?fieldNormExpl?=?new?Explanation();

????float?fieldNorm?=?norms?!=?null???decodeNormValue(norms.get(doc))?:?1.0f;

????fieldNormExpl.setValue(fieldNorm);

????fieldNormExpl.setDescription("fieldNorm(doc="+doc+")");

????fieldExpl.addDetail(fieldNormExpl);

????

????fieldExpl.setValue(tfExplanation.getValue()?*

???????????????????????stats.idf.getValue()?*

???????????????????????fieldNormExpl.getValue());

????result.addDetail(fieldExpl);

重點(diǎn)是這一句：

fieldExpl.setValue(tfExplanation.getValue()?*

???????????????????????stats.idf.getValue()?*

???????????????????????fieldNormExpl.getValue());

使用計(jì)算式表示就是

fieldWeight = tf * idf * fieldNorm

tf和idf的計(jì)算參考前面的，fieldNorm的計(jì)算在索引的時(shí)候確定了，此時(shí)直接從索引文件中讀取，這個(gè)方法并沒有給出直接的計(jì)算。如果使用DefaultSimilarity的話，它實(shí)際上就是lengthNorm，域越長的話Norm越小，在org/apache/lucene/search/similarities/DefaultSimilarity.java里面有關(guān)于它的計(jì)算：

??public?float?lengthNorm(FieldInvertState?state)?{

????final?int?numTerms;

????if?(discountOverlaps)

??????numTerms?=?state.getLength()?-?state.getNumOverlap();

????else

??????numTerms?=?state.getLength();

???return?state.getBoost()?*?((float)?(1.0?/?Math.sqrt(numTerms)));

??}

參考文獻(xiàn)：

【1】http://www.hankcs.com/program/java/lucene-scoring-algorithm-explained.html

【2】http://grantbb.iteye.com/blog/181802

轉(zhuǎn)載于:https://www.cnblogs.com/davidwang456/p/6150388.html

總結(jié)

以上是生活随笔為你收集整理的Lucene的评分(score)机制研究的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Living in the Matrix
下一篇： Reactor by Example--