日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

关于KL距离(KL Divergence)

發布時間:2023/12/10 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 关于KL距离(KL Divergence) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
作者:覃含章
鏈接:https://www.zhihu.com/question/29980971/answer/103807952
來源:知乎
著作權歸作者所有。商業轉載請聯系作者獲得授權,非商業轉載請注明出處。

最早KL divergence就是從信息論里引入的,不過既然題主問的是ML中的應用,就不多做具體介紹。只是簡單概述給定真實概率分布P和近似分布Q,KL divergence所表達的就是如果我們用一套最優的壓縮機制(compression scheme)來儲存Q的分布,對每個從P來的sample我需要多用的bits(相比我直接用一套最優的壓縮機制來儲存P的分布)。這也叫做 Kraft–McMillan theorem。

所以很自然的它可以被用作統計距離,因為它本身內在的概率意義。然而,也正因為這種意義,題主所說的不對稱性是不可避免的。因為D(P||Q)和D(Q||P)回答的是基于不同壓縮機制下的“距離”問題。

至于general的統計距離,當然,它們其實沒有本質差別。更廣泛的來看,KL divergence可以看成是phi-divergence的一種特殊情況(phi取log)。注意下面的定義是針對discrete probability distribution,但是把sum換成integral很自然可以定義連續版本的。

用其它的divergence理論來做上是沒有本質區別的,只要phi是convex, closed的。
因為它們都有相似的概率意義,比如說pinsker's theorem保證了KL-divergence是total variation metric的一個tight bound. 其它divergence metric應該也有類似的bound,最多就是order和常數會差一些。而且,用這些divergence定義的minimization問題也都會是convex的,但是具體的computation performance可能會有差別,所以KL還是用的多。

Reference: Bayraksan G, Love DK. Data-Driven Stochastic Programming Using Phi-Divergences.
作者:知乎用戶
鏈接:https://www.zhihu.com/question/29980971/answer/93489660
來源:知乎
著作權歸作者所有。商業轉載請聯系作者獲得授權,非商業轉載請注明出處。

KL divergence KL(p||q), in the context of information theory, measures the amount of extra bits (nats) that is necessary to describe samples from the distribution p with coding based on q instead of p itself. From the Kraft-Macmillan theorem, we know that the coding scheme for one value out of a set X can be represented q(x) = 2^(-l_i) as over X, where l_i is the length of the code for x_i in bits.

We know that KL divergence is also the relative entropy between two distributions, and that gives some intuition as to why in it's used in variational methods. Variational methods use functionals as measures in its objective function (i.e. entropy of a distribution takes in a distribution and return a scalar quantity). It's interpreted as the "loss of information" when using one distribution to approximate another, and is desirable in machine learning due to the fact that in models where dimensionality reduction is used, we would like to preserve as much information of the original input as possible. This is more obvious when looking at VAEs which use the KL divergence between the posterior q and prior p distribution over the latent variable z. Likewise, you can refer to EM, where we decompose

ln p(X) = L(q) + KL(q||p)

Here we maximize the lower bound on L(q) by minimizing the KL divergence, which becomes 0 when p(Z|X) = q(Z). However, in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.

Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.

KL divergence belongs to an alpha family of divergences, where the parameter alpha takes on separate limits for the forward and backwards KL. When alpha = 0, it becomes symmetric, and linearly related to the Hellinger distance. There are other metrics such as the Cauchy Schwartz divergence which are symmetric, but in machine learning settings where the goal is to learn simpler, tractable parameterizations of distributions which approximate a target, they might not be as useful as KL.

總結

以上是生活随笔為你收集整理的关于KL距离(KL Divergence)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。