當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

邮件伪造_伪造品背后的数学

發布時間：2023/12/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了邮件伪造_伪造品背后的数学小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

郵件偽造

入門指南 (An Introductory Guide)

Although many are familiar with the incredible results produced by deepfakes, most people find it hard to understand how the deepfakes actually work. Hopefully, this article will demystify some of the math that goes into creating a deepfake.

盡管許多人熟悉Deepfake產生的令人難以置信的結果，但大多數人發現很難理解Deepfake的實際工作原理。希望本文能使創建Deepfake的一些數學方法神秘化。

Deepfake generally refers to videos in which the face and/or voice of a person, usually a public figure, has been manipulated using artificial intelligence software in a way that makes the altered video look authentic. — Dictionary.com

Deepfake通常是指使用人工智能軟件以某種方式使人(通常是公眾人物)的臉部和/或聲音被操縱過的視頻，以使更改后的視頻看起來真實。 — Dictionary.com

It turns that deepfake is a sort of umbrella term, with no definitive way to create. However, most deepfakes are created with a deep learning framework known as generative adversarial nets, or GANs, so that will be the main focus of this article.

事實證明，deepfake是一種籠統的術語，沒有確定的創建方式。但是，大多數Deepfake都是使用稱為生成對抗網絡(GAN)的深度學習框架創建的，因此這將是本文的重點。

什么是GAN？ (What is a GAN?)

Generative adversarial nets — or GANs for short —are a deep learning model that was first proposed in a 2014 paper by Ian Goodfellow and his colleagues. The model operates by simultaneously training two neural networks in an adversarial game.

生成對抗網絡 (簡稱GAN)是一種深度學習模型，由Ian Goodfellow及其同事在2014年的論文中首次提出。該模型通過在對抗游戲中同時訓練兩個神經網絡來運行。

Abstractly, we would have generative model G, that is trying to learn a distribution p_g which replicates p_data, the distribution of the data set, while a discriminative model D tries to determine whether or not a piece of data came from the data set or the generator. Although seeing this for the first time may be intimidating, that math becomes relatively straightforward when looking at an example.

抽象，我們必須生成模型G，即試圖了解它復制P_DATA，該數據集的分布進行分布P_G，而判別模型d試圖確定一個數據是否從數據集或來發電機。盡管第一次看到這個可能會令人生畏，但是在看一個例子時，數學變得相對簡單。

Classically, GANs are explained are explained using the analogy of producing counterfeit money. To set up the situation, there is an organization of counterfeiters who try to produce counterfeit money, while the police are trying to detect whether or not money is counterfeited. Here, our counterfeiters can be treated as the generative model G that produces fake money with the distribution p_g. A distribution is essentially a “map” of characteristics that describes the features of money. Basically, the counterfeiters are producing money with some set of characteristics described by the distribution p_g. Furthermore, the role of the police is the discriminate between real and counterfeited money, so they play the part of the discriminative model D. In practice, these models are often multi-layer perceptrons, but there is no need to specify the type of neural network when only discussing theory.

傳統上，使用產生假幣的類比解釋GAN。為了解決這種情況，有一個偽造者組織，他們試圖生產偽造的錢，而警察則試圖查明是否偽造了金錢。在這里，我們的造假者可以被視為生成模型g的生成模型G ，該假模型的分布為p_g 。分配實質上是描述貨幣特征的特征“圖”。偽造者基本上是在生產具有分布p_g描述的某些特征的貨幣。此外，警察的作用是區分真錢和假幣，因此，他們扮演了區分模型D的角色。實際上，這些模型通常是多層感知器 ，但是僅在討論理論時就無需指定神經網絡的類型。

this is a setup for our example scenario這是我們示例場景的設置

Initially, the money produced by the counterfeiters might have many flaws, so the police can easily detect that the money is produced by the counterfeiters; in other words, the police know when money comes from the distribution p_g. As time progresses, both the police and counterfeiters become more proficient in their work. For the counterfeiters, this means that the money they produce will better resemble real money; mathematically, this is shown when the distribution of counterfeit money, p_g, approaches the distribution of real money, p_data. On the other hand, the police become more accurate at detecting whether or not money comes from p_data or p_g. However, the counterfeiters will eventually reach a point where the counterfeited money can pass for real money and fool the police. This occurs when the distributions p_g and p_data are the same; simply put, the features of the counterfeit money match those of real money. It turns out that this measure of “distance” can be calculated in many ways, each working in slightly different ways. With this knowledge in hand, we can set a goal for the counterfeiters: learn the distribution p_g, such that it equals the distribution of data p_data. Similarly, we set a goal for the police: maximize the accuracy of detecting counterfeit money.

最初，偽造者產生的金錢可能有很多缺陷，因此警察可以輕松地檢測到偽造者產生的金錢；換句話說，警察知道何時金錢來自分配p_g 。隨著時間的流逝，警察和造假者都變得更加精通他們的工作。對于造假者而言，這意味著他們產生的貨幣將更好地類似于真實貨幣；數學上，這是所示，當假幣分布，P_G，接近的真金白銀，P_DATA分布。在另一方面，警方成為檢測的錢是否來自P_DATA或P_G更準確。但是，偽造者最終將達到使偽造貨幣可以用來賺取真錢并欺騙警察的地步。當分布P_G和P_DATA是相同的這種情況發生時; 簡而言之，假幣的特征與真實貨幣的特征相匹配。事實證明，這種“距離”度量可以通過多種方式計算，每種方式的工作方式略有不同。有了這些知識在手，我們可以設置一個目標造假：學習分布P_G，使得它等于數據P_DATA的分布。同樣，我們為警察設定了一個目標：最大程度地提高發現假幣的準確性。

this is a slightly more formal setup for the example這是示例的更正式的設置

Up until now, we have largely neglected the specifics of how these models actually operate, so we will begin with describing the generator G. Going back to the previous example with counterfeiting money, our generator needs to take in some input that specifies what kind of money is being created. This means that input corresponding to creating a one dollar bill will differ from the input corresponding to creating a ten dollar bill. For consistency, we will define this input using the variable z that comes from the distribution p_z. The distribution p_z gives a rough idea of what kinds of money can be counterfeited. Furthermore, the outputs of the generator, expressed as G(z), can be described with the distribution p_g. Shifting our focus to the discriminator, we begin by examining the role it plays. Namely, our discriminator should tell us whether or not some piece data is from our data set or the generator. It turns out that probabilities are perfectly suited for this! Specifically, when our discriminator takes in some input x, D(x) should return a number between 0 and 1 representing the probability that x is from the data set. To see why our discriminator does is allowed to return values between 0 and 1, we will examine the case where our input somewhat resembles something from the data set. Revisiting our previous example, say we had a US dollar with small scuff marks in the corner and another US dollar with a figure of Putin printed on it. Without a doubt, the second bill is much more suspicious compared to first, so it is easily classified as fake (discriminator returns 0). However, our first bill still has the chance of being genuine, and classifying it with a 0 would mean is looks just as bad as bill number two. Obviously, we are losing some information regarding bill one, and it might be best to classify it with a number like 0.5, where our discriminator has some doubts that is genuine but is not certain that it is a fake. Simply put, our discriminator returns a number that represents its confidence level that an input comes from the data set.

到現在為止，我們在很大程度上都忽略了這些模型實際運行方式的細節，因此我們將從描述生成器G開始。回到前面關于偽造貨幣的示例，我們的生成器需要輸入一些輸入來指定要創建哪種貨幣。這意味著對應于創建一美元鈔票的輸入將不同于對應于創建十美元鈔票的輸入。為了保持一致性，我們將使用來自分布p_z的變量z定義此輸入。分布p_z給出了可以偽造哪種貨幣的粗略概念。此外，可以用分布p_g描述發電機的輸出，表示為G ( z )。將重點轉移到歧視者上，我們首先研究其所扮演的角色。也就是說，我們的鑒別器應該告訴我們一些片段數據是否來自我們的數據集或生成器。事實證明，概率完全適合于此！具體來說，當我們的鑒別器接受某些輸入x時， D ( x )應該返回一個介于0和1之間的數字，表示x來自數據集的概率。要了解為什么允許我們的鑒別器返回0到1之間的值，我們將研究輸入與數據集中的情況有點相似的情況。再看前面的例子，假設我們在角落里有一小劃痕的美元，在上面印有普京的數字。毫無疑問，第二張鈔票比第一張鈔票更具可疑性，因此很容易將其分類為偽造的(鑒別符返回0)。但是，我們的第一張鈔票仍然有可能是真實的，并且將其分類為0意味著看起來和第二張鈔票一樣糟糕。顯然，我們正在丟失一些有關法案一的信息，最好將其分類為0.5之類的數字，我們的判別器會懷疑是真實的，但不確定是假的。簡而言之，我們的鑒別器將返回一個數字，該數字表示其輸入來自數據集的置信度。

推導誤差函數 (Deriving an Error Function)

Now that we have a rough understanding of what our models G and D should be doing, we still need a way to evaluate their performances; this is where error functions come into play. Basically, an error function, E, tells us how poorly our model is performing given a its current set of parameters. For example, say we had a model that was being trained to recognize various objects. If we showed the model a bicycle, and the model sees a tricycle, the error function would return a relatively small error since the two are so similar. However, if the model saw the bicycle as a truck or school building, the error function would return a much larger number as there is little to no similarity in between these. In other words, error is low if the predictions of our model closely match the actual data, and error is large when the predictions do not match the actual data at all.

既然我們對G和D模型應該做的事情有了一個大概的了解，我們仍然需要一種評估它們性能的方法。這就是錯誤功能發揮作用的地方。基本上，誤差函數E告訴我們，在給定其當前參數集的情況下，模型的執行效果如何。例如，假設我們有一個正在接受訓練以識別各種物體的模型。如果我們向模型顯示自行車，并且模型看到三輪車，則誤差函數將返回相對較小的誤差，因為兩者是如此相似。但是，如果模型將自行車視為卡車或學校建筑物，則誤差函數將返回更大的數字，因為兩者之間幾乎沒有相似性。換句話說，如果我們模型的預測與實際數據緊密匹配，則誤差較小；而當預測與實際數據完全不匹配時，誤差較大。

Armed with this knowledge, we begin laying out some desired characteristics that our error function should have. First of all, the error function should return a large number when our discriminator misclassifies data, and a small number when data is classified correctly. In order to understand what this means, we begin by defining classifications. Essentially, a classification is a label for some piece of data. For a example, a red robin would be put under the classification of birds, while tuna would be put under the classification of fish. In our case, an input to our discriminator can come from two places, the data set or the generator. For convenience which we will see later on, we classify data that comes the generator by giving it a label of 0, while data that comes from the data set will be given the label 1. Using this, we can further elaborate on our error function. For example, say we have some piece of data, x, with the label 1. If our discriminator predicts that x is from the data set (D(x) returns a number close to 1), then our discriminator would have correctly predicted the classification of x and the error would be low. However, if our discriminator predicted that x was from the generator (D(x) returns a number close to 0), then our discriminator would have incorrectly classified our data and error would be high.

掌握了這些知識之后，我們便開始列出錯誤函數應具有的一些所需特征。首先，當我們的鑒別器對數據進行錯誤分類時，錯誤函數應該返回一個大數字，而當數據正確分類時，錯誤函數應該返回一個小數字。為了理解這意味著什么，我們首先定義分類。本質上，分類是某些數據的標簽。例如，將紅知更鳥放在鳥的分類下，而金槍魚則放在魚的分類下。在我們的例子中，鑒別器的輸入可以來自兩個地方，即數據集或生成器。為了方便起見，我們將在后面看到，我們通過將生成器中的數據標記為0來對生成器中的數據進行分類，而將數據集中的數據標記為1。使用此標記，我們可以進一步詳細說明錯誤函數。例如，假設我們有一些數據x帶有標簽1。如果我們的鑒別器預測x來自數據集( D ( x )返回接近1的數字)，那么我們的鑒別器將正確預測x的分類，誤差很小。但是，如果我們的鑒別器預測x來自生成器( D ( x )返回接近0的數字)，那么我們的鑒別器將錯誤地對我們的數據進行分類，并且誤差會很大。

this represents how our error function should behave這代表了我們的錯誤函數應該如何表現

As we look for an ideal function, we notice that the graph of y = log(x) on the interval [0,1] matches our specification after some manipulation.

當我們尋找理想函數時，我們注意到間隔[0,1]上的y = log( x )的圖經過一些操作后符合我們的規范。

the graph of y = log(x) y的圖= log(x)

In particular, flipping the graph around the x-axis results results in the error function where our label is 1. Reflecting this new graph across the line y=0.5, then reveals the error function for when our label is 0. The equations for these are y = -log(x) and y = -log(1-x) respectively, and can be seen below.

特別是，將圖形繞x軸翻轉會導致我們的標簽為1的誤差函數。在y = 0.5線上反映此新圖形，然后揭示標簽為0時的誤差函數。這些方程式分別為y = -log( x )和y = -log(1- x )，如下所示。

label = 0 on the left and 標簽= 0且右側的label = 1 on the right標簽= 1時的錯誤函數

Putting these two functions together, we can create the following “piece-wise” function.

將這兩個函數放在一起，我們可以創建以下“逐段”函數。

We can substitute x = D(G(z)) when label = 0 and x = D(x) when label = 1. When label = 0, we are evaluating the error of our discriminator when it takes an image from the generator as input. When label = 1, we are finding the error of our discriminator when it takes something from our data set as an input.當label = 0時，我們可以用x = D(G(z))代替；當label = 1時， x = D(x) 。當label = 0時，我們正在評估鑒別器的錯誤，當鑒別器將來自生成器的圖像作為輸入時。當label = 1時 ，當我們從數據集中獲取數據作為輸入時，我們正在發現鑒別器的錯誤。

Unfortunately, this formula is a little cumbersome to write out, so want to find a way to reduce down to one line. We begin by giving our error function a proper name, like E. Additionally, we will also want to create a variable to represent our label, since writing out label is inefficient; we will call this new variable y. Here is where a little bit of genius comes into play. When we treat y not only as a label, but also as a number, we can actually reduce this formula into the following:

不幸的是，這個公式寫起來有點麻煩，所以想找到一種減少到一行的方法。我們首先給錯誤函數起一個適當的名稱，例如E。另外，由于寫標簽效率低下，我們還希望創建一個變量來表示標簽。我們將這個新變量稱為y 。這是一些天才發揮作用的地方。當我們不僅將y視為標簽，而且將其視為數字時，我們實際上可以將該公式簡化為以下形式：

Notice, that when y = 0 (label is 0), the (1 - y) coefficient turns into 1, while the term y(log(D(x)) turns into 0. When y = 1 (label is 1), something similar occurs. The first term reduces to 0 leaving us with -log(D(x)). It turns out that these results exactly equal our “piece-wise” function. On an unrelated note, this error function is also known as binary cross entropy.

請注意，當y = 0(標號為0)時，(1- y )系數變為1，而項y (log( D ( x ( x )))變為0。當y = 1(標號為1)時，類似的情況發生了，第一項減少到0，剩下-log( D ( x ( x ))。結果證明，這些結果與我們的“逐段”函數完全相同。在一個不相關的注釋中，該誤差函數也稱為二元交叉熵 。

One quick thing to note is that the paper which introduces GANs uses the error function -E instead. Therefore, in order to stay consistent with the original paper, we will redefine our error function to -E.

值得注意的一件事是，介紹GAN的論文使用了誤差函數-E 。因此，為了與原始論文保持一致，我們將誤差函數重新定義為-E 。

this is the error function after minor adjustments to better represent what was originally presented in Ian Goodfellow’s paper這是經過細微調整以更好地表示最初在Ian Goodfellow的論文中介紹的內容的誤差函數

This change in the formula means an incorrect prediction (i.e. — y = 0 but D outputs 1) will result in an error of -∞ as opposed to ∞.

公式中的這一變化意味著錯誤的預測(即y = 0但D輸出1)將導致-∞的誤差，而不是∞的誤差。

應用誤差函數 (Applying the Error Function)

After deriving a suitable error function for our GAN, the next reasonable step is to apply it to the current setup.

在為我們的GAN導出合適的誤差函數之后，下一步是將其應用于當前設置。

The first step in this process is to set some goals for our models. Essentially, our discriminator, D, should aim to classify all of its inputs correctly, while the generator, G, should try to trick the discriminator by making it misclassify as much data as possible. With these two goals in mind, we now begin to analyze the behavior of our error function. Right away, it is easy to see that the error function attains a maximum value of 0, which only occurs when the discriminator perfectly classifies everything with 100% confidence (this is especially easy to see using the definition of our error function). Additionally, our error function attains a minimum at -∞, which only occurs when the discriminator is 100% confident in its predictions, but is always wrong (this may occur if D(x) is 0 but y = 1).

此過程的第一步是為我們的模型設定一些目標。本質上，我們的鑒別器D應該旨在正確地對其所有輸入進行分類，而生成器G則應嘗試通過使分類器對盡可能多的數據進行錯誤分類來欺騙鑒別器。考慮到這兩個目標，我們現在開始分析誤差函數的行為。馬上就可以看出，誤差函數的最大值為0，只有當鑒別器以100％的置信度對所有事物進行完美分類時，才會出現該最大值(使用誤差函數的定義尤其容易看出這一點)。此外，我們的誤差函數在-∞處達到最小值，僅當鑒別器對其預測具有100％的置信度時才會發生，但始終是錯誤的(如果D ( x )為0但y = 1可能會發生)。

Combining these two insights, we are able to mathematically formulate a competition between the two models G and D. Namely, G is attempting to minimize our error function (G wants the error to be -∞), while D is trying to maximize it (D wants to error to be 0). This sort of adversarial competition is also known as a mini-max game, where the models G and D are competing against each other like players. As a result, we find it more intuitive to call E a value function, V(G,D), where G’s goal is the minimize the value of V(G,D), while D’s goal is to maximize the value function. This can be described with the following expression:

結合這兩種見解，我們可以在數學上公式化兩個模型G和D之間的競爭。即， G試圖使誤差函數最小化( G希望誤差為-∞)，而D試圖使誤差函數最大化( D希望誤差為0)。這種對抗性競爭也稱為迷你－最大 游戲，其中模型G和D像玩家一樣相互競爭。結果，我們發現將E稱為值函數 V ( G ， D )更直觀，其中G的目標是最小化V ( G ， D )的值，而D的目標是最大化V ( G ， D )的值。值函數。這可以用以下表達式來描述：

unfortunately, this expression is not yet complete; do you see how to improve it?不幸的是，這個表達還不完整。你看到如何改善它？

However, the above formula has a critical flaw: it only takes in a single input at a time. In order to improve the utility of this function, it would be best for it to calculate the error over all of our data (this includes both the data set and everything generated by the generator). This is where it becomes more useful to find the aggregate or total error that the models have over the entire data set. In fact, we can find this total error by just summing up the error for each individual input. To see where this will lead us, we must examine now examine the cases where an input to our discriminator comes from the data set and the cases where an input comes from the generator.

但是，上面的公式有一個嚴重的缺陷：一次只能輸入一個輸入。為了提高此功能的實用性，最好對所有數據(包括數據集和生成器生成的所有數據)計算誤差。在這里，找到模型在整個數據集上的匯總誤差或總誤差變得更加有用。實際上，我們可以通過對每個單獨輸入的誤差求和來找到總誤差。為了了解這將導致我們什么，我們現在必須檢查判別器輸入來自數據集的情況以及輸入來自生成器的情況。

When an input to the discriminator comes from the data set, y will be equal to 1. This means that the value function for that single instance of data becomes log(D(x)). Consequently, if we were to find the error for every piece of data from our data set, the total error for these data entries would be the number of entries in the data multiplied with the error for a single entry in the data set. Of course, this is assuming that the error is roughly the same for each entry in the data set. Additionally, we can mathematically describe the number data entries in our data set using 𝔼_(x ∈ p_data), where 𝔼 represent expected value. Essentially, this expression returns the expected number of entries that are in the distribution p_data, which is the distribution describing our data set.

當鑒別符的輸入來自數據集時， y將等于1。這意味著該單個數據實例的值函數變為log( D ( x ))。因此，如果我們要從數據集中找到每個數據的錯誤，則這些數據條目的總錯誤將是數據中條目的數目乘以數據集中單個條目的錯誤。當然，這是假設數據集中每個條目的錯誤大致相同。此外，我們可以用數學描述我們的數據集采用𝔼_(X∈P_DATA)數的數據項，其中𝔼表示期望值。本質上，該表達式返回分布p_data中的預期條目數，分布p_data是描述我們的數據集的分布。

Similarly, when an input to the discriminator comes from the generator, y will be equal to 0, so the value function reduces to log(1-D(G(z))). As a result, finding the total error for everything produced by the generator is equal to the number of items produced by the generator multiplied by the error for a single item produced by the generator (this assumes the error is roughly the same for each item). Once again, we represent the number of items produced by the generator with 𝔼_(z ∈ p_z). The reason we use z instead is because we are trying to find error when the input to the discriminator comes from the generator, and items produced by the generator are defined by the input z. Essentially, 𝔼_(z ∈ p_z) gives us a good idea of the number of items produced by the generator.

同樣，當鑒別器的輸入來自生成器時， y將等于0，因此值函數減小為log(1- D ( G ( z )))。結果，找到生成器產生的所有事物的總誤差等于生成器產生的事物的數量乘以生成器產生的單個事物的誤差(假定每個物件的誤差大致相同) 。再次，我們表示由所述發電機與𝔼_(Z∈p_z)生產的物品的數量。我們改用z的原因是因為當鑒別符的輸入來自生成器，并且生成器生成的項由輸入z定義時，我們試圖查找錯誤。從本質上講，𝔼_(Z∈p_z)為我們提供了發電機所產生的項目數的一個不錯的主意。

Putting our last two insights together, we can achieve a suitable value function:

綜合我們的最后兩個見解，我們可以實現合適的價值函數：

this is also the value function that appears in Goodfellow’s original paper這也是Goodfellow原始論文中出現的值函數

訓練GAN (Training the GAN)

Recall our end goal for the training: the generator must be able to fool the discriminator. This means that the generator’s distribution of outputs, p_g, must equal the distribution of the data set, p_data. However, this we may not want the p_g to exactly equal p_data. To see why this is, think about the case where there are outliers in the data set. If we trained our generator to produce outputs with the exact distribution p_data, our generator will inevitably produce some faulty outputs due to these outliers. This is why we want our distributions to approximately equal other.

回顧我們訓練的最終目標：生成器必須能夠欺騙鑒別器。這意味著，輸出發電機的分布，P_G，必須等于數據集的分布，P_DATA。然而，這一點，我們可能不希望P_G恰好等于P_DATA。要了解為什么會出現這種情況，請考慮數據集中存在異常值的情況。如果我們訓練生成器以產生具有精確分布p_data的輸出，由于這些異常值，我們的生成器將不可避免地產生一些錯誤的輸出。這就是為什么我們希望我們的分布近似相等。

this is the end goal for our training這是我們培訓的最終目標

分布之間的距離 (Distances between distributions)

Now that we know what we are aiming for in our training procedure, we still lack a way to rigorously define what it means for two distributions to approximate each other. This is math comes up with a notion of distance between distributions. Essentially, the distance between distributions gives us a measure of how similar two distribution are to each other. This is easily visualized in the figure below.

既然我們知道了訓練過程中要達到的目標，我們仍然缺乏一種方法來嚴格定義兩種分布相互近似的含義。這是數學中分布之間的距離的概念。本質上，分布之間的距離使我們可以衡量兩個分布之間的相似程度。在下圖中很容易看到這一點。

It turns out that, depending on how our distance function is defined, the results of training will vary. This will be covered in further reading.

事實證明，根據我們的距離函數的定義方式，訓練的結果會有所不同。這將在進一步的閱讀中介紹。

草繪算法 (Sketching the Algorithm)

With this rough understanding of distances, we now have sufficient knowledge to build the framework for an algorithm that trains these models (it turns out that different ways of defining distance will lead to different results, this is seen in the further reading). At its core, our training algorithm will rely on stochastic gradient descent to update the model parameters (gradient descent will not be covered in this article as there many other resources covering the topic). However, training a GAN is unique in that the algorithm must alternate between the models G and D. This is because if all the focus is put on training the discriminator, D will become too good at prevent our generator from learning. Additionally, if we only focus on training the generator, D will be unable to learn properly and also provide useless feedback to the generator. Consequently, our algorithm will continue to repeat the following cycle until our end goal is met:

通過對距離的這種粗略了解，我們現在已經擁有足夠的知識來構建用于訓練這些模型的算法的框架(事實證明，定義距離的不同方法將導致不同的結果，這在進一步的閱讀中將會看到)。從根本上講，我們的訓練算法將依賴于隨機梯度下降來更新模型參數(本文將不涉及梯度下降，因為還有許多其他資源涉及該主題)。但是，訓練GAN的獨特之處在于算法必須在模型G和D之間交替。這是因為，如果將所有精力都放在訓練判別器上， D就會變得太擅長阻止我們的生成器學習。此外，如果我們僅專注于訓練生成器，則D將無法正確學習，也無法向生成器提供無用的反饋。因此，我們的算法將繼續重復以下循環，直到達到最終目標為止：

Update the discriminator’s parameters k times (k is an arbitrary constant)

更新鑒別器的參數k次( k是任意常數)

Update the generator’s parameters one time

一次更新發生器的參數

Unfortunately, at the beginning of our training, the gradient of our value function may not provide a large enough gradient; this prevents G from learning effectively. Notice that changes to G only affect the term log(1-D(G(z))), so this becomes what G wants to minimize. Plotting this out, we see minimizing this expression is equal to maximizing the expression log(D(G(z))). Training our model in this way is much more efficient as the gradients it provides are larger in the early stages of learning.

不幸的是，在我們開始訓練時，我們的價值函數的梯度可能無法提供足夠大的梯度。這會阻止G有效學習。請注意，對G的更改僅影響項log(1- D ( G ( z )))，因此這成為G想要最小化的項。對此進行規劃，我們看到最小化此表達式等于最大化表達式log( D ( G ( z ())))。以這種方式訓練我們的模型效率更高，因為它在學習的早期階段提供的梯度更大。

Eventually, this method of training is guaranteed to converge at the optimal discriminator, denoted D*. The proof that this optimal discriminator exists will be shown in further reading.

最終，保證了這種訓練方法收斂于最佳區分符，表示為D *。在進一步的閱讀中將顯示存在這種最佳鑒別器的證據。

進一步閱讀 (Further Reading)

距離 (Distances)

Although distance is easy enough to eyeball, we need a concrete formula if we are to incorporate distance into our training process. As a result, we must find a suitable distance function.

盡管距離很容易使人眼球，但是如果要將距離納入訓練過程中，我們需要一個具體的公式。結果，我們必須找到合適的距離函數。

We begin our search with Kullback-Leibler divergence and Jensen-Shannon divergence, the same place where Goodfellow and his colleages started.

我們從Kullback-Leibler分歧和Jensen-Shannon分歧開始搜索，這是Goodfellow及其同事開始的地方。

Kullback-Leibler分歧(KL分歧) (Kullback-Leibler Divergence (KL Divergence))

This article will only aim to give a general grasp on what KL divergence accomplishes. To start off, it is important to note that KL divergence is not a distance metric, because it is asymmetrical and does not satisfy the triangle inequality. This means that, given two probability distributions P and Q, the KL divergence from P to Q is different than the KL divergence from Q to P. Below, we see the the mathematical formula that gives the KL divergence from the distribution P to Q.

本文僅旨在大致了解KL差異實現的內容。首先，重要的是要注意，KL散度不是距離度量，因為它是不對稱的并且不滿足三角形不等式。這意味著，給定兩個概率分布P和Q ，從P到Q的KL發散不同于從Q到P的KL發散。在下面，我們看到了給出從分布P到Q的KL發散的數學公式。

these are the mathematical formulae for calculating KL divergence between distributions這些是用于計算分布之間的KL散度的數學公式

Notice that there are two ways to calculate KL divergence. The first way is used when P and Q are discrete distributions. The second formula is used when P and Q are continuous distributions, while p(x) and q(x) are the probability densities of P and Q respectively. With these basic definitions, we can further “classify” KL divergence into two categories: forward KL divergence and reverse KL divergence. For two distributions P and Q, forward KL is defined as KL(P||Q) while reverse KL is defined as KL(Q||P).

請注意，有兩種方法可以計算KL散度。當P和Q是離散分布時，使用第一種方法。當P和Q是連續分布時，使用第二個公式，而p ( x )和q ( x )分別是P和Q的概率密度。利用這些基本定義，我們可以將KL散度進一步“分類”為兩類： 正向KL散度和反向KL散度 。對于兩個分布P和Q ，正向KL定義為KL( P || Q )，反向KL定義為KL( Q || P )。

As a result, when we are trying the minimize the distance between p_g and p_data in training our GAN, we are essentially minimizing the KL divergence between the distributions; mathematically, this is expressed as min(KL(p_g||p_data)).

其結果是，當我們試圖最小化在訓練我們的GAN，我們基本上最小化分布之間的KL發散P_G和P_DATA之間的距離; 在數學上，這表示為min(KL( p_g || p_data ))。

Now, we can begin to analyze what happens when we use forward and reverse KL to train our GAN. When using forward KL, we aim to minimize KL(p_data||p_g), so the distribution p_g will essentially spread out across p_data in order to minimize KL divergence. This can be seen below where p_g is plotted in red and p_data is plotted in blue.

現在，我們可以開始分析當使用正向和反向KL訓練GAN時會發生什么。當使用前KL，我們的目標是盡量減少KL(P_DATA || P_G)，所以分配P_G將主要以盡量減少KL散鋪開P_DATA。這可以看到下面其中P_G以紅色繪制并P_DATA以藍色繪制。

using forward KL to train a GAN用前KL訓練GAN

When p_g is initialized as seen in the left, we that there are certain places along the plot where p_g is near 0 while p_data is not. Plugging these into our formula for forward KL divergence, we see that there will be terms where log(p_data(x) / p_g(x)) will approach infinity. In order to prevent this from occurring, p_g is essentially stretched out such that forward KL divergence no longer blows up. This is known as mean-seeking or zero-avoidance behavior.

當在左視圖中P_G被初始化，我們有一起其中P_G接近0，而P_DATA不是劇情某些地方。這些堵到我們向前KL散度公式，我們看到會有方面，其中log(P_DATA(X)/ P_G(X))將接近無窮大。為了防止這種情況發生， p_g基本上被拉伸，使得向前的KL散度不再爆炸。這被稱為均值搜尋或零回避行為。

Applying reverse KL, our goal becomes minimizing the KL(p_g||p_data). Here, p_g will end up covering a single mode if the distribution p_data. This can be visualized below:

應用反向KL，我們的目標是使KL( p_g || p_data )最小。在這里，P_G最終將覆蓋如果分發P_DATA單一模式。可以在下面看到：

using reverse KL to train a GAN使用反向KL訓練GAN

In order to minimize the reverse KL divergence, we want to maximize the number of terms in the summation that go to 0. This means that we want p_g to have many points where p_g is near 0 but p_data is not (places like this have a KL divergence of 0 — this can be verified by plugging numbers into our formula). Additionally, the mode under which p_g lies will also have a KL divergence near 0. This is because the expression log(p_g / p_data) will evaluate near log(1), which reduces to 0. This sort of “shrinking” behavior is known as mode-seeking behavior.

為了盡量減少反向KL分歧，我們希望以最大化總和即變為0，這意味著我們要P_G有多少分，其中P_G接近0，但P_DATA不是項數(這樣的地方有一個KL散度為0，這可以通過在公式中插入數字來驗證)。此外， p_g所處的模式也將使 KL散度接近0。這是因為表達式log( p_g / p_data )的計算結果接近log(1)，減小到0。這種“收縮”行為是已知的作為尋求模式的行為。

Unfortunately, when we look at the case where we have two discrete distributions that do not overlap, the KL divergence will blow up to infinity. This is undesirable and will lead to issues in training. This is where Jensen-Shannon Divergence comes into play.

不幸的是，當我們看到有兩個不重疊的離散分布的情況時，KL散度將爆炸到無窮大。這是不希望的，并且會導致培訓問題。這就是詹森-香農分歧(Jensen-Shannon Divergence)發揮作用的地方。

詹森·香農散度(JSD) (Jensen-Shannon Divergence (JSD))

Jensen-Shannon Divergence or JSD is an alternative method of measuring distance. It uses elements of KL divergence but can combat the case where the distributions do not overlap. The formula for calculating JSD is shown below:

Jensen-Shannon發散或JSD是另一種測量距離的方法。它使用KL散度的元素，但可以解決分布不重疊的情況。 JSD的計算公式如下所示：

formula for calculating JSD from distribution P to Q從分布P到Q的JSD計算公式

It turns out that when our distributions do not overlap, the JSD actually converges to log(2). This means that we now have a way to effectively measure the distance between distribution without having to worry about divergence going to infinity; consequently, JSD is superior to KL divergence.

事實證明，當我們的分布不重疊時，JSD實際上收斂到log(2)。這意味著我們現在有一種方法可以有效地測量分布之間的距離，而不必擔心發散會達到無窮大。因此，JSD優于KL分歧。

this video gives a deeper dive into KL divergence and JSD (video credits: Ahlad Kumar)該視頻深入探討了KL分歧和JSD(視頻來源：Ahlad Kumar)

This concludes an introductory glimpse into distance functions and how they can be used to train GANs. However, even JSD is not without its flaws. As a result, researchers may choose to use a variation on GAN, such as the Wasserstein GAN (this uses Wasserstein distance) or Info GAN.

到此，我們將對距離函數以及如何將其用于訓練GAN進行簡要介紹。但是，即使JSD也并非沒有缺陷。結果，研究人員可能選擇在GAN上使用變體，例如Wasserstein GAN (使用Wasserstein距離)或Info GAN 。

最優鑒別器存在的證明 (Proof that the Optimal Discriminator Exists)

Once our algorithm has been sketched out, we still need to prove that it accomplishes what it sets out to do. Namely, we want to prove that our algorithm produces the optimal discriminator, denoted D*.

勾勒出我們的算法后，我們仍然需要證明它能夠完成其設定的工作。即，我們想證明我們的算法產生了最優的鑒別器，記為D *。

I will begin by making a proposition: when G is fixed, the optimal discriminator D is given by the following.

我將提出一個命題：當G固定時，最優判別器D由下式給出。

GG的最佳判別器

Proof: The goal of the generator is to maximize the value function V(G,D). Notice that the expected value for data set can instead be written as an integral over the distribution of data; similar rules apply for the generator. This leaves us with:

證明：生成器的目標是使值函數V ( G ， D )最大化。注意，數據集的期望值可以寫為數據分布上的整數。類似的規則也適用于發電機。這給我們留下了：

the value function written using integrals使用積分編寫的值函數

From here, we can then make a “change of a variable.” Namely, we replace G(z) with x and change our distribution from p_z to p_g. This is essentially rewriting the second term in terms of the output that is produced by G.

然后，我們可以從此處進行“變量的更改”。即，我們用x替換G ( z )并將分布從p_z更改為p_g 。這實際上是根據G產生的輸出來重寫第二項。

simplifying the value function簡化價值函數

Now, maximizing the V becomes a matter of maximizing the integrand. From basic calculus, we know that the maximum value of the expression a?log(x)+b?log(1-x) will attain its maximum on the interval [0,1] at (a)/(a+b). Plugging this into our value function, we get that the discriminator which maximizes the integrand will be what we proposed above. This concludes our proof.

現在，最大化V成為最大化被積數的問題。從基本的演算，我們知道，表達的 ?log(X)+b?log(1-X)的最大值將達到其最大值在區間[0,1]的(A)/(A + B) 。將其插入到我們的值函數中，我們可以得出使被積數最大化的鑒別器將是我們上面提出的。這證明了我們的證明。

結論 (Conclusion)

In summary, this article aimed to cover the math that goes into creating deepfakes, particularly GANs. This introductory look into GANs should provide you with enough information to get a general understanding of how a deepfake might have been created using artificial intelligence. All images except the one in the title where created by the author and equations were rendered using LaTex in TexStudio. If you have any questions or just want to talk, feel free to reach out to me at felix.liu.2005@gmail.com.

總而言之，本文旨在介紹創建深層偽造(尤其是GAN)時所采用的數學方法。對GAN的介紹性介紹應為您提供足夠的信息，以使您對使用人工智能如何創建Deepfake產生大致了解。除標題中由作者和方程式創建的圖像外，所有圖像均使用TexStudio中的LaTex渲染。如果您有任何疑問或想聊天，請隨時通過felix.liu.2005@gmail.com與我聯系。

翻譯自: https://towardsdatascience.com/the-math-behind-deepfakes-b9ef8621541f

郵件偽造

總結

以上是生活随笔為你收集整理的邮件伪造_伪造品背后的数学的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： S12塞恩大乱斗怎么玩 2022老司机怎
下一篇：图像匹配与OpenCV模板匹配