當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CIKM 2021 | Deep Retrieval：字节跳动深度召回模型论文精读

發布時間：2024/10/8 编程问答 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 CIKM 2021 | Deep Retrieval：字节跳动深度召回模型论文精读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?作者 | 杰尼小子

單位 | 字節跳動

研究方向 | 推薦算法

文章動機/出發點

這是一篇字節跳動發表在 CIKM 2021 的論文，這一項工作在字節很多業務都上線了，效果很不錯。但是這篇文章整體讀下來，感覺有挺多地方讓人挺迷茫的，有可能是因為文章篇幅有限。值得一提的是，這篇 paper 曾經投過 ICLR，有 open review，所以最后也會總結一下 reviewer 的問題同時寫一下自己的一些思考。

論文名稱：

Deep?Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations

論文名稱：

https://arxiv.org/abs/2007.07203

To the best of our knowledge, DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.

正如文章所說的，作者提出了一種非 ANN based 召回方法，之前的方法通常是基于 ANN（approximate nearest neighbors）或者 MIPS（maximum inner product search）的算法，這種算法有一些眾所周知的問題：

基于 ANN 的召回模型通常是雙塔模型，User Tower 跟 Item Tower 通常是分開的，這就導致了用戶的交互比較簡單，只在最后內積的時候進行交互
而 ANN 跟 MIPS 算法比如：LSH, IVF-PQ, HNSW。是為了近似 Top K 或者搜索最大內積而設計的，并不是直接優化 user-item 訓練樣本。
另一個就是召回模型與索引構建是分開的，容易存在召回模型跟索引的版本不一致的問題，特別是新 item 進來的時候。

TDM 是阿里提出來的一種樹召回模型，主要用來解決這種不一致問題，把之前基于內積模型的 two-stage 變成了直接學習訓練樣本的 one-stage 模型。但是作者認為這種方法存在一些問題：每個 item 都被映射到樹中的一個葉節點，這使得樹結構本身很難學習。在葉級別上可用的數據可能很稀少，并且可能無法提供足夠的信號來為擁有數億條目的推薦系統學習更精細的樹結構。

因此作者也提出了 end-to-end 的方法，與 TDM 為主的樹模型不同，每個 item 可以被分到多個 path 或者 cluster，從而可以改進一個葉子節點樣本過少的問題。

作者針對內積模型的一些問題，所以 DR 是直接 end-to-end 的學習訓練樣本的，針對樹模型的一些問題，所以 DR 可以讓一個 item 屬于多個 path/cluster。但其實 DR 的架構加了挺多東西，既沒有做消融實驗去分解每一部分的作用，也沒有通過實驗去解釋如何解決了這些問題，只通過了一些“直覺上”的分析來解釋為什么要 xxx，如果作者可以加上這一部分的實驗分析個人感覺就是一個特別棒的的工作了，畢竟他的實際效果很有保證。

模型框架

第一次讀完論文比較懵，原文寫的比較分散，因此在這里我重新梳理邏輯，希望可以清晰易懂介紹一下 DR 的模型框架。

2.1 總體概述

首先說一下 DR 的整體框架，如上圖所示，DR 有 D 層網絡，每一層是一個 K 維 softmax 的 MLP，每一層的輸入是原始的 user emb 與前面每一層的 Embedding 進行 concat。因此這里有條 path，每一條 path 都可以理解成一個 cluster。

DR 的目的就是：給定一個 user，把訓練集這個用戶互動過的所有 item（比如點擊 item 等）映射到這里面的某幾個 cluster 中，每個 cluster 可以被表示成一個 D 維的向量，比如 [36, 27, 20]，其中可以包括多個 item，每個item也可以屬于多個 cluster。所有的 user 共享中間的網絡，線上 serve 的時候，輸入 user 信息（比如 user id）即可自動找到與其相關的 cluster，然后把每個 cluster 中的 item 當做召回的 item candidate.

另外，經過上述流程最后的召回候選集可能比較大，因此 DR 在后面又加了一個 rerank 模塊，這里的 rerank 不是我們通常說的?精排后面的重排/混排，而且用來對前面 DR 模型召回的 item 進行進一步過濾，如上圖所示。也就是把前面 DR 召回的候選集當做新的訓練集，送給后面的 rerank 模型，論文里公開數據集離線實驗使用的是 softmax 去學習，真正字節業務上使用的是 LR，softmax 分類的 loss 如下，可以加深理解：

這里 softmax 輸出的 size 是 V（所有 items 的數量)，然后使用 sample softmax 進行優化。個人覺得這里目的是：給定 ，預測用戶與這個 item 交互過的置信度，選出 top N 送入下面流程。

rerank 這一小節的名字叫做：Multi-task Learning and Reranking with Softmax Models，我其實沒搞明白為啥是 multi-task 了。另外在 rerank 這里進行控量而不是在粗排那里過濾的原因是通常 DR 給的 item 很多，粗排階段容易擠壓別的路的候選集從而主導粗排的學習？

通過上述的整體框架，可以發現以下幾個特點：

DR 的輸入只用了 user 的信息，沒有使用 item 的信息，與 YouTube DNN 類似
DR 的訓練集只有正樣本，沒有負樣本。
DR 不但要學習每一層網絡的參數，也要學習給定 user 如何把一個 item 映射到某一個 cluster/path 里面（item-to-path mapping ）。

之前 introduction 說基于內積的召回模型 user 跟 item 交互比較少，然后 DR 根本沒用到 item 的信息……

接下來，我們詳細介紹模型每一部分是如何學習的，先定義一下符號體系：

一共 D 層 MLP，每層 MLP 有 K 個 node
是所有 item 的 label
item-to-path mapping ：代表如何將一個 item 映射成里面的某幾條 path。
(x, y) 代表用戶 x 與 item y 的一條正樣本（比如點擊，轉化，點贊等）。
代表將 item y 映射成一條 path，其中 .
每一條 path 的概率是 D 層 MLP 每一層 node 概率的連乘。

2.2 網絡參數的學習

之前說了，我們不但要學習 D 層 MLP 的參數，也要學習如何進行 item-to-path mapping，這里我們首先固定 path 的學習，也就是假設我們知道每個 item 最后屬于哪幾條 path ? 。

給定一個 N 條樣本的訓練集，對于其中一條 path 的最大 log 似然函數為：

這里作者認為將每個 item 分類到一個 cluster 其實是不夠的，比如巧克力可以是“food”，也可以是“gift”，因此作者將一個 item 映射到 J 個 path 里面，因此對于 itme 他的 path 表示為：，多條 path 的 log 似然函數表示為：

屬于多條路徑的概率是屬于每條路徑的概率之和。但是直接優化這個目標有一個很嚴重的問題，就是直接把所有的 item 都分類到某一個 path 即可，這樣對于每個 user 屬于這個 path 的概率都是 1，因此所有的 item 都在一個類別了，召回也就失效了。因此對上述函數加了懲罰：

其中是懲罰因子，這里 f 字節實驗用的是。聯合之前的 rerank 學習，最后的 loss 就是。

到這里我們就了解了當給定 path 的時候如何進行網絡參數的優化，比較棘手的是如何確定每個 item 分到哪幾個 path 里面。

2.3 Beam Search for Inference

在講如何學習之前，首先講一下如何 inference，也就是給定一個用戶 x，如何學習他屬于哪個 cluster。這里的場景有點類似 NLP 里面的 seq2seq 的? inference，主流有 Greedy Search 與 Beam Search，文章使用的是 Beam Search。

設 B 為Beam Search的搜索參數，每一層有 K*B 個候選，復雜度為：O（KBlogB），整體復雜度為：O（DKBlogB）。具體的算法流程如下：

2.4 item-to-path learning

下面來介紹重頭戲，就是一個 item 如何確定去哪幾個 path，這個跟網絡參數不一樣的是，他不是連續的不用能梯度下降進行優化，因此作者使用 EM 的形式交替進行優化。首先隨機初始化跟其他參數，在第 t 個 epoch：

E-step：鎖住，通過最大化 structure objective 優化。
M-step：同樣通過最大化 structure objective 更新。

首先不考慮懲罰項，嘗試優化，給定，我們需要選出來讓最大，重新梳理一下公式：

外面的加法是所有 item 這一層次上的，里面這個加法是 item ? 出現在所有樣本這一層次上的。最直接方法就是枚舉中路徑，然后找出 top J 大的路徑，這顯然不可能。因此作者嘗試根據優化公式的上界：

其中樣本出現在訓練集的次數，主要把放到了 log 里面。定義 score functools 為：

所以 score 可以理解為 item 在 c 上的所有概率和，當然不可能把所有的 c 都枚舉一遍，因此使用了 beam search 挑出來 top S 個 path（具體多少文章沒有說），然后其他的 path 得分設置為 0。

一般是最大化一個函數的下界，最小化一個函數的上界來優化這個函數，這里是相反的，作者指出這里不能保證可以優化函數，但是他們實驗是有效果的，因此這樣去優化。

現在知道了的定義，結合最終目標，我們 M-Step 的優化目標是：

其中與要優化的無關，所以可以直接去掉，最終的優化目標就是

以上函數沒有封閉解（closed-form solution），所以使用坐標下降法求解（coordinate descent algorithm），鎖住其他 item 只優化其中一個 item 的。對于每個 v 我們一步一步的學習他的。詳細的算法流程如下：

對于第 i 步，選擇的收益為：

因此我們就從 i=1 開始每次貪心的選擇增益最大的最為當前 path。作者指出，3 到 5 次迭代就足以保證算法收斂。時間復雜度隨詞匯量 V、路徑多樣性 J 和候選路徑數量 S 呈線性增長。

2.5 兩個數據流合并score

因為字節是流式計算，所以肯定是要合并數據流的，基本的思想就是追蹤 top S 的 score，對于 item v 我們有一個 score list ，然后新來一批數據，我們有了新的 score list ，合并規則如下：

作者指出這種方法增加了在流方式中探索新路徑的可能性。

實驗

因為實驗做的比較少，重點也比較少，這里就簡單一說：

離線公開數據集這里因為訓練集測試集是分開的，因此沒有使用 user id 去提取 Embedding，而是截取用戶歷史行為到 69（多了截斷少了補 default），過一個 GRU 獲取最后的 emb 當做輸入。

實驗結果：

超參的影響，都在亞馬遜數據集上做的，沒必要分析了：

在線的結果：

作者發現 a/b 測試的時候，DR 對于一些長尾視頻以及長尾作者比較友好，他們認為在 DR 結構的每個路徑中，item 是不可區分的，這允許檢索不太受歡迎的 item，只要它們與受歡迎的 item 共享一些類似的行為。

DR 適合于流訓練，而且構建檢索結構的時間比 HNSW 要少得多，因為 DR 的 M-step 中不涉及任何用戶或項目嵌入的計算。用多線程 CPU?實現處理所有項目大約需要 10 分鐘。

ICLR review

這里放一些 ICLR 的 review（review 地址 [1]），可以一起思考一些問題，因為作者沒有進行 rebuttal，所以是看不到作者的回復的，很多 review 真的是一針見血。

The paper lacks a motivation for using the proposed scheme. It says that for tree-based models, the number of parameters is proportional to the number of clusters and hence it is a problem. This is not clear why this is such a problem. Successful application of tree-structure for large-scale problem has been demonstrated in [1,2]. Also, it is not clear how the proposed method addresses data scarcity, which according to the paper happens only in tree-based methods, and not in the proposed method as there are no leaves.
It is not clear how the proposed structure model (of using K \times D matrix) is different from the Chen etal 2018. The differences and similarities compared to this work should be clearly specified. Also, what seems to be missing is why such an architecture of using stacked multi-layer perceptrons should lead to better performance especially in positive data-scarcity situations where most of the users 'like' or 'buy' only few items.
The experimental comparison looks unclear and incomplete. The comparison should also be done with the approach proposed in Zhou etal 2020 ICML paper. At the end of Page 6 it is said that the results of JTM were only available for Amazon Books. How do you make sure that same training and test split (as in JTM) is used as the description says that test and validation set is done randomly. Also, it would also be good to see the code and be able to reproduce the results.
References of some key papers are based on arxiv versions, such as Chen etal 2018 and Zhou etal. 2020, where both the papers have been accepted in ICML conference of respective years.
The authors mainly claim that the objective of learning vector representation and good inner-product search are not well aligned, and the dependence on inner-products of user/item embeddings may not be sufficient to capture their interactions. This is an ongoing research discussion on this domain. I'd recommend the authors to refer to a recent paper, proposing the opposite direction from this submission: Neural Collaborative Filtering vs. Matrix Factorization Revisited (RecSys 2020)?arxiv.org/abs/2005.0968
According to Figure 1, a user embedding is given as an input, and the proposed model outputs probability distribution over all possible item codes, which in turn interpreted as items. That being said, it seems the user embedding is highly important in this model. A user can be modeled in a various ways, e.g., as a sequence of items consumed, or using some?meta-data. If the user embeddings are not representative enough, the proposed model may not work, and on the other hand, if the user embedding is strong, it will estimate the probs more precisely. We would like to see more discussion on this.
In the experiment, there are multiple points that can be addressed. (a) Related to the point #2, the quality of embeddings is not controlled. Thus, comparing DR against brute-force proves that the proposed method is effective on MIPS, but not on the end-to-end retrieval. Ideally, we'd like to see experiments with multiple SOTA embeddings to see if applying DR to those embeddings still improves end-to-end retrieval performance. See examples below. (b) The baselines used in the experiment are not representing the current SOTA. Item-based CF is quite an old method, and YouTube DNN is not fully reproducible due to the discrepancy on input features (which are not publicly available outside of YouTube). We recommend comparing against / using embeddings of LLORMA (JMLR'16), EASE^R (WWW'19), and RecVAE (WSDM'20). (c) Evaluation metrics are somewhat arbitrary. The authors used only one k for P@k, R@k, and F1@k, arbitrarily chosen for each dataset. This may look like a cherry-picking, so we recommend to report scores with multiple k, e.g., {1, 5, 10, 50, 100}. Taking a metric like MAP or NDCG is another option.
The main contribution of this paper seems faster retrieval on MIPS. Overall, the paper is well-written. We recommend adding more intuitive description why the proposed mathematical form guarantees / leads to the optimal / better alignment to the retrieval structure. That is, how/why the use of greedy search leads to the optimal selection of item codes.
The significant feature of the DR model (from the claim in the abstract) to encode all candidates in a discrete latent space. However, there are some previous attempts in this direction that are not discussed. For example VQ-VAE[1] also learns a discrete space. Another more related example is HashRec[2], which (end-to-end) learns binary codes for users and items for efficient hash table retrieval. It's not clear of the connections and why the proposed discrete structure is more suitable.
The experiments didn't show the superiority of the proposed method. As a retrieval method, the most common comparison method (e.g.?github.com/erikbern/ann) is the plot of performance-retrieval time, which is absent in this paper. The paper didn't compare the efficiency against the baselines like TDM, JTM, or ANN-based models, which makes the experiments less convincing as the better performance may due to the longer retrieval time.
What's the performance of purely using softmax?
It seems only DR uses RNNs for sequential behavior modeling, while the baselines didn't. This'd be a unfair comparison, and sequential methods should be included if DR uses RNN and sequential actions for training.
I didn't understand the motivation of using the multi-path extension. As you already encode each item in D different clusters, this should be enough to express different aspects with a larger D. Why a multi-path variant is needed for making the model more expressive?
The Beam Search may not guarantee sub-linear time complexity due to the new?hyper-parameter?B. It's possible that a very large B is needed for retrieving enough candidates.

一些疑問

1. EM 的訓練方式基于一個最大化函數上界的優化，收斂性怎么在理論上進行保證呢。

2. 如果不加原來的 rerank 效果會掉很多嗎。

3. 如果 user 比較低活，可能 d 層 mlp 學習的不太好，會不會召回性能反而不好，可以考慮一下分人群看一下結果。

4. 論文并沒有定性分析，好奇每個 path 里面的 item 真的足夠相關嗎，因為感覺 m-step 并沒有哪里可以促進相關 item 映射成一個 path 的地方。

總結

這是一篇在字節很多場景已經取到很好效果的文章，但相對來說業務經驗輸出的少了一些，一些觀點論證包括實驗也少了一些。超參數很多，真正落地可能需要針對自己的業務進行大量的調參。

更多閱讀

#投稿?通道#

?讓你的文字被更多人看到?

如何才能讓更多的優質內容以更短路徑到達讀者群體，縮短讀者尋找優質內容的成本呢？答案就是：你不認識的人。

總有一些你不認識的人，知道你想知道的東西。PaperWeekly 或許可以成為一座橋梁，促使不同背景、不同方向的學者和學術靈感相互碰撞，迸發出更多的可能性。?

PaperWeekly 鼓勵高校實驗室或個人，在我們的平臺上分享各類優質內容，可以是最新論文解讀，也可以是學術熱點剖析、科研心得或競賽經驗講解等。我們的目的只有一個，讓知識真正流動起來。

📝?稿件基本要求：

? 文章確系個人原創作品，未曾在公開渠道發表，如為其他平臺已發表或待發表的文章，請明確標注?

? 稿件建議以?markdown?格式撰寫，文中配圖以附件形式發送，要求圖片清晰，無版權問題

? PaperWeekly 尊重原作者署名權，并將為每篇被采納的原創首發稿件，提供業內具有競爭力稿酬，具體依據文章閱讀量和文章質量階梯制結算

📬?投稿通道：

? 投稿郵箱：hr@paperweekly.site?

? 來稿請備注即時聯系方式（微信），以便我們在稿件選用的第一時間聯系作者

? 您也可以直接添加小編微信（pwbot02）快速投稿，備注：姓名-投稿

△長按添加PaperWeekly小編

🔍

現在，在「知乎」也能找到我們了

進入知乎首頁搜索「PaperWeekly」

點擊「關注」訂閱我們的專欄吧

總結

以上是生活随笔為你收集整理的CIKM 2021 | Deep Retrieval：字节跳动深度召回模型论文精读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：设备编号不一致是什么意思农信
下一篇： MindCon极客周 | 第三届全新席卷