當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

ElastricSearch打分简介

發(fā)布時(shí)間：2023/12/31 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 ElastricSearch打分简介小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1、Elasticsearch的打分公式

Elasticsearch的默認(rèn)打分公式是lucene的打分公式，主要分為兩部分的計(jì)算，一部分是計(jì)算query部分的得分，另一部分是計(jì)算field部分的得分，下面給出ES官網(wǎng)給出的打分公式：

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) ) (t in q)

queryNorm(q)：

對(duì)查詢進(jìn)行一個(gè)歸一化，不影響排序，因?yàn)閷?duì)于同一個(gè)查詢這個(gè)值是相同的，但是對(duì)term于ES來(lái)說(shuō)，必須在分片是1的時(shí)候才不影響排序，否則的話，還是會(huì)有一些細(xì)小的區(qū)別，有幾個(gè)分片就會(huì)有幾個(gè)不同的queryNorm值

queryNorm(q)=1 / √sumOfSquaredWeights?

上述公式是ES官網(wǎng)的公式，這是在默認(rèn)query boost為1，并且在默認(rèn)term boost為1 的情況下的打分，其中

sumOfSquaredWeights?=idf(t1)*idf(t1)+idf(t2)*idf(t2)+...+idf(tn)*idf(tn)

其中n為在query里面切成term的個(gè)數(shù)，但是上面全部是在默認(rèn)為1的情況下的計(jì)算，實(shí)際上的計(jì)算公式如下所示：

coord(q,d):

coord(q,d)是一個(gè)協(xié)調(diào)因子它的值如下：

coord(q,d)=overlap/maxoverlap

其中overlap是檢索命中query中term的個(gè)數(shù)，maxoverlap是query中總共的term個(gè)數(shù)，例如查詢?cè)~為“無(wú)線通信”，使用默認(rèn)分詞器，如果文檔為“通知他們開會(huì)”，只會(huì)有一個(gè)“通”命中，這個(gè)時(shí)候它的值就是1/4=0.25

tf(t in d):

即term t在文檔中出現(xiàn)的個(gè)數(shù)，它的計(jì)算公式官網(wǎng)給出的是：

tf(t in d) = √frequency

即出現(xiàn)的個(gè)數(shù)進(jìn)行開方，這個(gè)沒(méi)什么可以講述的，實(shí)際打分也是如此

idf(t):

這個(gè)的意思是出現(xiàn)的逆詞頻數(shù)，即召回的文檔在總文檔中出現(xiàn)過(guò)多少次，這個(gè)的計(jì)算在ES中與lucene中有些區(qū)別，只有在分片數(shù)為1的情況下，與lucene的計(jì)算是一致的，如果不唯一，那么每一個(gè)分片都有一個(gè)不同的idf的值，它的計(jì)算方式如下所示：

idf(t) = 1 + log ( numDocs / (docFreq + 1))

其中，log是以e為底的，不是以10或者以2為底，這點(diǎn)需要注意，numDocs是指所有的文檔個(gè)數(shù)，如果有分片的話，就是指的是在當(dāng)前分片下總的文檔個(gè)數(shù)，docFreq是指召回文檔的個(gè)數(shù)，如果有分片對(duì)應(yīng)的也是在當(dāng)前分片下召回的個(gè)數(shù)，這點(diǎn)是計(jì)算的時(shí)候與lucene不同之處，如果想驗(yàn)證是否正確，只需將分片shard的個(gè)數(shù)設(shè)置為1即可。

t.getboost():

對(duì)于每一個(gè)term的權(quán)值，沒(méi)仔細(xì)研究這個(gè)項(xiàng)，個(gè)人理解的是，如果對(duì)一個(gè)field設(shè)置boost，那么如果在這個(gè)boost召回的話，每一個(gè)term的boost都是該field的boost

norm(t,d):

對(duì)于field的標(biāo)準(zhǔn)化因子，在官方給的解釋是field越短，如果召回的話權(quán)重越大，例如搜索無(wú)線通信，一個(gè)是很長(zhǎng)的內(nèi)容，但都是包含這幾個(gè)字，但是并不是我們想要的，另外一個(gè)內(nèi)容很短，但是完整包含了無(wú)線通信，我們不能因?yàn)楹竺娴闹怀霈F(xiàn)了一次就認(rèn)為權(quán)重是低的，相反，權(quán)重應(yīng)當(dāng)是更高的，其計(jì)算公式如下所示：

其中d.getboost表明如果該文檔權(quán)重越大那么久越重要

f.getboost表明該field的權(quán)值越大，越重要

lengthnorm表示該field越長(zhǎng)，越不重要，越短，越重要，在官方文檔給出的公式中，默認(rèn)boost全部為1，在此給出官方文檔的打分公式：

norm(d) = 1 / √numTerms

以上的是理論上的，看看實(shí)際例子

GET act_shop-2018.01.12/shop/_search {"size": 1, "query": {"term": {"name.keyword": "星巴克"}}, "explain": true }

結(jié)果是

{"took": 25,"timed_out": false,"_shards": {"total": 150,"successful": 150,"failed": 0},"hits": {"total": 127667,"max_score": 15.511484,"hits": [{"_shard": "[act_shop-2018.01.12][80]","_node": "6vfIeV95QOK1vAcLdx6CEA","_index": "act_shop-2018.01.12","_type": "shop","_id": "187672","_score": 15.511484,"_routing": "36341","_parent": "36341","_source": {"status": 1,"city": {"id": 2084,"name": "虹口區(qū)"},"update_time": "2017-10-23 15:23:00.329000","tel": ["021-65200108"],"name": "星巴克(涼城店)","tags": ["餐飲服務(wù)","咖啡廳","咖啡廳"],"tags_enrich": {"name": "美食","id": 10},"id": 187672,"label": "have_act","create_time": "2017-01-11 14:59:43.950000","city_enrich": {"region": "華東地區(qū)","name": "上海","level": 1},"address": "車站南路330弄2號(hào)、6號(hào)第一、二層的4839F01059","coordinate": {"lat": 31.29496,"lon": 121.475442},"brand": {"id": 490,"name": "星巴克"}},"_explanation": {"value": 15.511484,"description": "sum of:","details": [{"value": 15.511484,"description": "sum of:","details": [{"value": 4.7601295,"description": "weight(name:星 in 6914) [PerFieldSimilarity], result of:","details": [{"value": 4.7601295,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 4.314013,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 159,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]},{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}]}]},{"value": 5.0423846,"description": "weight(name:巴 in 6914) [PerFieldSimilarity], result of:","details": [{"value": 5.0423846,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 4.5698156,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 123,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]},{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}]}]},{"value": 5.70897,"description": "weight(name:克 in 6914) [PerFieldSimilarity], result of:","details": [{"value": 5.70897,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 5.173929,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 67,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]},{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}]}]}]},{"value": 0,"description": "match on required clause, product of:","details": [{"value": 0,"description": "# clause","details": []},{"value": 1,"description": "_type:shop, product of:","details": [{"value": 1,"description": "boost","details": []},{"value": 1,"description": "queryNorm","details": []}]}]}]}}]} }

詳細(xì)說(shuō)明一下

1、在?"_shard": "[act_shop-2018.01.12][80]"這個(gè)分片里，按照es的標(biāo)準(zhǔn)分詞，當(dāng)match'星巴克'的時(shí)候，然后會(huì)分詞為'星'，'巴'，'克'這三個(gè)詞。每個(gè)詞的得分為:

'星'：4.7601295

'巴'：5.0423846

'克'：5.70897

總的得分：4.7601295+5.0423846+5.70897=15.511484

2、然后每個(gè)詞是怎么得分的，這里詳細(xì)說(shuō)一下，以'星'為例：

sorce'星'=idf.tfNorm（也就是詞頻*逆向詞頻）

idf計(jì)算如下：

{"value": 4.7601295,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 4.314013,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 159,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]}

docFreq:在這個(gè)分片里，擊中'星'的文檔數(shù)量：159

docCount：在這個(gè)分片里，包括總的文檔數(shù)量:11920

公式：log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))=4.314013

tfNorm計(jì)算如下

tf可以理解為，這個(gè)'星'，在某個(gè)文檔里出現(xiàn)的次數(shù)的一些占比

{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}

tfNorm=(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))=1.103411

所以sorce'星'=idf.tfNorm=4.314013*1.103411=4.7601295

轉(zhuǎn)載于:https://my.oschina.net/u/3455048/blog/1606033

總結(jié)

以上是生活随笔為你收集整理的ElastricSearch打分简介的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：随机过程：【1】基于MATLAB对泊松过
下一篇： Abelssoft CryptBox20