當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

关于KL距离（KL Divergence）

發布時間：2023/12/10 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了关于KL距离（KL Divergence）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：覃含章
鏈接：https://www.zhihu.com/question/29980971/answer/103807952
來源：知乎
著作權歸作者所有。商業轉載請聯系作者獲得授權，非商業轉載請注明出處。

最早KL divergence就是從信息論里引入的，不過既然題主問的是ML中的應用，就不多做具體介紹。只是簡單概述給定真實概率分布P和近似分布Q，KL divergence所表達的就是如果我們用一套最優的壓縮機制(compression scheme)來儲存Q的分布，對每個從P來的sample我需要多用的bits（相比我直接用一套最優的壓縮機制來儲存P的分布）。這也叫做 Kraft–McMillan theorem。

所以很自然的它可以被用作統計距離，因為它本身內在的概率意義。然而，也正因為這種意義，題主所說的不對稱性是不可避免的。因為D(P||Q)和D(Q||P)回答的是基于不同壓縮機制下的“距離”問題。

至于general的統計距離，當然，它們其實沒有本質差別。更廣泛的來看，KL divergence可以看成是phi-divergence的一種特殊情況（phi取log）。注意下面的定義是針對discrete probability distribution,但是把sum換成integral很自然可以定義連續版本的。

用其它的divergence理論來做上是沒有本質區別的，只要phi是convex, closed的。
因為它們都有相似的概率意義，比如說pinsker's theorem保證了KL-divergence是total variation metric的一個tight bound. 其它divergence metric應該也有類似的bound，最多就是order和常數會差一些。而且，用這些divergence定義的minimization問題也都會是convex的，但是具體的computation performance可能會有差別，所以KL還是用的多。

Reference: Bayraksan G, Love DK. Data-Driven Stochastic Programming Using Phi-Divergences.
作者：知乎用戶
鏈接：https://www.zhihu.com/question/29980971/answer/93489660
來源：知乎
著作權歸作者所有。商業轉載請聯系作者獲得授權，非商業轉載請注明出處。

KL divergence KL(p||q), in the context of information theory, measures the amount of extra bits (nats) that is necessary to describe samples from the distribution p with coding based on q instead of p itself. From the Kraft-Macmillan theorem, we know that the coding scheme for one value out of a set X can be represented q(x) = 2^(-l_i) as over X, where l_i is the length of the code for x_i in bits.

We know that KL divergence is also the relative entropy between two distributions, and that gives some intuition as to why in it's used in variational methods. Variational methods use functionals as measures in its objective function (i.e. entropy of a distribution takes in a distribution and return a scalar quantity). It's interpreted as the "loss of information" when using one distribution to approximate another, and is desirable in machine learning due to the fact that in models where dimensionality reduction is used, we would like to preserve as much information of the original input as possible. This is more obvious when looking at VAEs which use the KL divergence between the posterior q and prior p distribution over the latent variable z. Likewise, you can refer to EM, where we decompose

ln p(X) = L(q) + KL(q||p)

Here we maximize the lower bound on L(q) by minimizing the KL divergence, which becomes 0 when p(Z|X) = q(Z). However, in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.

Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.

KL divergence belongs to an alpha family of divergences, where the parameter alpha takes on separate limits for the forward and backwards KL. When alpha = 0, it becomes symmetric, and linearly related to the Hellinger distance. There are other metrics such as the Cauchy Schwartz divergence which are symmetric, but in machine learning settings where the goal is to learn simpler, tractable parameterizations of distributions which approximate a target, they might not be as useful as KL.

總結

以上是生活随笔為你收集整理的关于KL距离（KL Divergence）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：让你的Silverlight程序部署在任
下一篇： treegrid.bootstrap使用

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

关于KL距离（KL Divergence）

總結