机器学习 可视化_机器学习-可视化
機器學習 可視化
機器學習導論 (Introduction to machine learning)
In the traditional hard-coded approach, we program a computer to perform a certain task. We tell it exactly what to do when it receives a certain input. In mathematical terms, this is like saying that we write the f(x) such that when users feed the input x into f(x), it gives the correct output y.
在傳統的硬編碼方法中,我們對計算機進行編程以執行特定任務。 我們確切地告訴它在收到某個輸入時該怎么做。 用數學術語來說,這就像說我們寫f(x) 這樣,當用戶將輸入x饋入f(x)時 ,它會給出正確的輸出y 。
In machine learning, however, we have a large set of inputs x and corresponding outputs y but not the function f(x). The goal here is to find the f(x) that transforms the input x into the output y. Well, that’s not an easy job. In this article, we will learn how this happens.
但是,在機器學習中,我們有大量的輸入x和對應的輸出y,但沒有函數f(x) 。 這里的目標是找到將輸入x轉換為輸出y的f(x) 。 好吧,這不是一件容易的事。 在本文中,我們將學習如何發生這種情況。
數據集 (Dataset)
To visualize the dataset, let’s make our synthetic dataset where each data point (input x) is 3 dimensional, making it suitable to be plotted on a 3D chart. We will generate 250 points (cluster 0) in a cluster centered at the origin (0, 0, 0). A similar cluster of 250 points (cluster 1) is generated but not centered at the origin. Both clusters are relatively close but there is a clear separation as seen in the image below. These two clusters are the two classes of data points. The big green dot represents the centroid of the whole dataset.
為了可視化數據集,讓我們制作合成數據集,其中每個數據點(輸入x )都是3維的,使其適合繪制在3D圖表上。 我們將在以原點(0,0,0 )為中心的群集中生成250個點(群集 0)。 生成了一個類似的250點群集(群集1) ,但未將其原點居中。 兩個群集都相對較近,但存在明顯的分離,如下圖所示。 這兩個群集是兩類數據點。 大綠點代表整個數據集的質心。
Image by Author圖片作者After generating the dataset, we will normalize it by subtracting the mean and dividing by the standard deviation. This is done to zero-center the data and map values in each dimension in the dataset to a common scale. This speeds up the learning.
生成數據集后,我們將通過減去平均值并除以標準差來對其進行歸一化。 這樣做是為了使數據零中心并將數據集中每個維度的值映射到一個公共比例。 這樣可以加快學習速度。
The data will be saved in an array X containing the 3D coordinates of normalized points. We will also generate an array Y with the value either 0 or 1 at each index depending on which cluster the 3D point belongs.
數據將保存在包含歸一化點的3D坐標的數組X中。 我們還將根據3D點所屬的簇,在每個索引處生成一個值為0或1的數組Y。
易學的功能 (Learnable Function)
Now that we have our data ready, we can say that we have the x and y. We know that the dataset is linearly separable implying that there is a plane that can divide the dataset into the two clusters, but we don’t know what the equation of such an optimal plane is. For now, let’s just take a random plane.
現在我們已經準備好數據,可以說我們有x和y。 我們知道數據集是線性可分離的,這意味著存在一個可以將數據集分為兩個簇的平面,但是我們不知道這種最佳平面的方程是什么。 現在,讓我們隨機乘坐飛機。
The function f(x) should take a 3D coordinate as input and output a number between 0 and 1. If this number is less than 0.5, this point belongs to cluster 0 otherwise, it belongs to cluster 1. Let’s define a simple function for this task.
函數f(x)應該以3D坐標作為輸入,并輸出一個介于0和1之間的數字。如果該數字小于0.5,則此點屬于聚類0,否則,它屬于聚類1。讓我們為這個任務。
x: input tensor of shape (num_points, 3)W: Weight (parameter) of shape (3, 1) chosen randomlyB: Bias (parameter) of shape (1, 1) chosen randomlySigmoid: A function that maps values between 0 and 1
x :形狀(num_points,3)的輸入張量W:形狀(3,1)的權重(參數)隨機選擇B:形狀(1,1)的偏差(參數)隨機選擇Sigmoid:映射0到1之間的值的函數
Let’s take a moment to understand what this function means. Before applying the sigmoid function, we are simply creating a linear mapping from the 3D coordinate (input) to 1D output. Therefore, this function will squish the whole 3D space onto a line meaning that each point in the original 3D space will now be lying somewhere on this line. Since this line will extend to infinity, we map it to [0, 1] using the Sigmoid function. As a result, for each given input, f(x) will output a value between 0 and 1.
讓我們花一點時間來了解此功能的含義。 在應用sigmoid函數之前,我們僅創建從3D坐標(輸入)到1D輸出的線性映射。 因此, 此功能會將整個3D空間壓縮到一條線上,這意味著原始3D空間中的每個點現在都將位于此線上的某個位置。 由于這條線將延伸到無窮大,因此我們使用Sigmoid函數將其映射到[0,1] 。 結果,對于每個給定的輸入, f(x)將輸出一個介于0和1之間的值。
Remember that W and B are chosen randomly and so the 3D space will be squished onto a random line. The decision boundary for this transformation is the set of points that make f(x) = 0.5. Think why! As the 3D space is being squished onto a 1D line, a whole plane is mapped to the value 0.5 on the line. This plane is the decision boundary for f(x). Ideally, it should divide the dataset into two clusters but since W and B are randomly chosen, this plane is randomly oriented as shown below.
請記住,W和B是隨機選擇的,因此3D空間將被壓縮到隨機線上。 此變換的決策邊界是使f(x)= 0.5的點集。 想想為什么! 當3D空間被壓縮到1D線上時,整個平面將映射到該線上的值0.5。 該平面是f(x)的決策邊界。 理想情況下,它應該將數據集分為兩個簇,但是由于W和B是隨機選擇的,因此該平面的方向如下圖所示。
Image by Author圖片作者Our goal is to find the right values for W and B that orients this plane (decision boundary) in such a way that it divides the dataset into the two clusters. This when done, yields a plane as shown below.
我們的目標是為W和B找到正確的值,以該方向將平面(決策邊界)劃分為將數據集分為兩個簇。 完成后,將產生一個如下所示的平面。
Image by Author圖片作者失利 (Loss)
So, we are now at the starting point (random decision boundary) and we have defined the goal. We need a metric to decide how far we are from the goal. The output of the classifier is a tensor of shape (num_points, 1) where each value is between [0, 1]. If you think carefully, these values are just the probabilities of the points belonging to cluster 1. So, we can say that:
因此,我們現在處于起點(隨機決策邊界),并且已經定義了目標。 我們需要一個度量標準來決定我們離目標有多遠。 分類器的輸出是形狀的張量(num_points,1),其中每個值都在[0,1]之間 。 如果仔細考慮,這些值只是屬于聚類1的點的概率。因此,我們可以這樣說:
- f(x) = P(x belongs to cluster 1) f(x)= P(x屬于簇1)
- 1-f(x) = P(x belongs to cluster 0) 1-f(x)= P(x屬于簇0)
It wouldn’t be wrong to say that [1-f(x), f(x)] forms a probability distribution over the clusters 0 and cluster 1 respectively. This is the predicted probability distribution. We know for sure which cluster every point in the dataset belongs to (from y). So, we also have the true probability distribution as:
說[ 1-f(x),f(x) ]分別在聚類0和聚類1上形成概率分布是沒有錯的。 這是預測的概率分布 。 我們可以確定(從y )數據集中的每個點都屬于哪個群集。 因此,我們也有真實的概率分布為:
- [0, 1] when x belongs to the cluster 1 [0,1]當x屬于集群1
- [1, 0] when x belongs to the cluster 0 [1,0]當x屬于簇0時
A good metric to calculate the incongruity between two probability distributions is the Cross-Entropy function. As we are dealing with just 2 classes, we can use Binary Cross-Entropy (BCE). This function is available in PyTorch’s torch.nn module. If the predicted probability distribution is very similar to the true probability distribution, this function returns a small value and vice versa. We can average this value for all the data points and use it as a parameter to test how the classifier is performing.
交叉熵函數是計算兩個概率分布之間的不一致性的一個很好的指標。 由于我們僅處理2個類,因此可以使用二進制交叉熵(BCE)。 該功能在PyTorch的torch.nn模塊中可用。 如果預測的概率分布與真實的概率分布非常相似,則此函數返回一個較小的值,反之亦然。 我們可以對所有數據點取平均值,然后將其用作測試分類器性能的參數。
This value is called the loss and mathematically, our goal now is to minimize this loss.
該值稱為損失,從數學上講,我們現在的目標是使這種損失最小化。
訓練 (Training)
Now that we have defined our goal mathematically, how do we reach our goal practically? In other words, how do we find optimal values for W and B? To understand this, we will take a look at some basic calculus. Recall that we currently have random values for W and B. The process of learning or training or reaching the goal or minimizing the loss can be divided into two steps:
既然我們已經在數學上定義了目標,那么實際上如何實現目標呢? 換句話說,我們如何找到W和B的最佳值? 為了理解這一點,我們將看一些基本的演算。 回想一下,我們目前有W和B的隨機值。 學習或培訓,達到目標或使損失最小化的過程可以分為兩個步驟:
Forward-propagation: We feed the dataset through the classifier f(x) and use BCE to find the loss.
正向傳播:我們通過分類器f(x)饋送數據集,并使用BCE查找損失 。
Backpropagation: Using the loss, adjust the values of W and B to minimize the loss.
反向傳播:使用損耗,調整W和B的值以使損耗最小。
The above two steps will be repeated over and over again until the loss stops decreasing. In this condition, we say that we have reached the goal!
將重復上述兩個步驟,直到損失停止減少。 在這種情況下,我們說我們已經達到了目標!
反向傳播 (Backpropagation)
Forward propagation is simple and already discussed above. However, it is essential to take a moment to understand backpropagation as it is the key to machine learning. Recall that we have 3 parameters (variables) in W and 1 in B. So, in total, we have 4 values to optimize.
前向傳播很簡單,上面已經討論過了。 但是,必須花一點時間來了解反向傳播,因為它是機器學習的關鍵。 回想一下,我們在W中有3個參數(變量),在B中有1個參數。 因此,總共有4個值需要優化。
Once we have the loss from forward-propagation, we will calculate the gradients of the loss function with respect to each variable in the classifier. If we plot the loss for different values of each parameter, we can see that the loss is minimum at a particular value for each parameter. I have plotted the loss vs parameter for each parameter.
一旦有了正向傳播的損失,我們將針對分類器中的每個變量計算損失函數的梯度。 如果我們針對每個參數的不同值繪制損耗,則可以看到每個參數在特定值下損耗最小。 我已經為每個參數繪制了損耗與參數的關系圖。
Image by Author圖片作者An important observation to make here is that the loss is minimized at a particular value for each of these parameters as shown by the red dot.
此處要進行的重要觀察是,對于這些參數中的每個參數,損耗都在特定值處最小化,如紅點所示。
Let’s consider the first plot and discuss how w1 will be optimized. The process remains the same for the other parameters. Initially, the values for W and B are chosen randomly and so (w1, loss) will be randomly placed on this curve as shown by the green dot.
讓我們考慮第一個圖并討論如何優化w1。 其他參數的處理過程相同。 最初,W和B的值是隨機選擇的,因此(w1,損耗)將隨機放置在該曲線上,如綠點所示。
Image by Author圖片作者Now, the goal is to reach the red dot, starting from the green dot. In other words, we need to move downhill. Looking at the slope of the curve at the green dot, we can tell that increasing w1 (moving right) will lower the loss and therefore move the green dot closer to the red one. In mathematical terms, if the gradient of the loss with respect to w1 is negative, increase w1 to move downhill and vice versa. Therefore, w1 should be updated as:
現在,目標是從綠點開始到達紅點。 換句話說,我們需要走下坡路。 觀察綠點處的曲線斜率,我們可以知道增加w1(向右移動)將降低損耗,因此將綠點移向紅色。 用數學術語來說,如果損耗相對于w1的斜率為負,則增加w1即可下坡,反之亦然。 因此,w1應該更新為:
The equation above is known as gradient descent equation. Here, the learning_rate controls how much we want to increase or decrease w1. If the learning_rate is large, the update will be large. This could lead to w1 going past the red dot and therefore missing the optimal value. If this value is too small, it will take forever for w1 to reach the red dot. You can try experimenting with different values of learning rate to see which works the best. In general, small values like 0.01 works well for most cases.
上面的方程式稱為梯度下降方程式 。 在這里, learning_rate控制我們要增加或減少w1的數量。 如果learning_rate大,則更新將大。 這可能導致w1越過紅點并因此缺少最佳值。 如果此值太小,則w1永遠需要到達紅點。 您可以嘗試使用不同的學習率值進行實驗,以了解哪種方法效果最好。 通常,在大多數情況下,較小的值(如0.01)效果很好。
In most cases, a single update is not enough to optimize these parameters; so, the process of forward-propagation and backpropagation is repeated in a loop until the loss stops reducing further. Let’s see this in action:
在大多數情況下,單次更新不足以優化這些參數。 因此,向前傳播和向后傳播的過程會循環重復,直到損耗不再進一步降低為止。 讓我們來看看實際情況:
Gif by AuthorGif作者An important observation to make is that initially the green dot moves quickly and slows down as it gradually approaches the minima. The large slope (gradient) during the first few epochs (when the green dot is far from the minima) is responsible for this large update to the parameters. The gradient decreases as the green dot approaches the minima and thus the update becomes slow. The other three parameters are trained in parallel in the exact same way. Another important observation is that the shape of the curve changes with epoch. This is due to the fact that the other three parameters (w2, w3, b) are also being updated in parallel and each parameter contributes to the shape of the loss curve.
需要注意的重要一點是,綠點最初隨著其逐漸接近最小值而快速移動并減慢速度。 前幾個時期(當綠點遠離最小值時)的大斜率(梯度)是對參數的大更新的原因。 隨著綠點接近最小值,梯度減小,因此更新變慢。 其他三個參數以完全相同的方式并行訓練。 另一個重要的觀察結果是曲線的形狀會隨著時代的變化而變化。 這是由于以下事實:其他三個參數(w2,w3,b)也正在并行更新,并且每個參數都對損耗曲線的形狀有所貢獻。
可視化 (Visualize)
Let’s see how the decision boundary updates in real-time as the parameters are being updated.
讓我們看一下如何隨著參數的更新實時更新決策邊界。
Gif by AuthorGif作者那是所有人! (That’s all folks!)
If you made it till here, hats off to you! In this article, we took a visual approach to understand how machine learning works. So far, we have seen how a simple 3D to 1D mapping, f(x), can be used to fit a decision boundary (2D plane) to a linearly separable dataset (3D). We discussed how forward propagation is used to calculate the loss followed by backpropagation where gradients of the loss with respect to parameters are calculated and the parameters are updated repeatedly in a training loop.
如果您做到了這里,就向您致敬! 在本文中,我們采用了一種直觀的方法來了解機器學習的工作原理。 到目前為止,我們已經看到如何使用簡單的3D到1D映射f(x)將決策邊界(2D平面)擬合到線性可分離數據集(3D)。 我們討論了如何使用前向傳播來計算損耗,然后進行反向傳播,在反向傳播中,將計算相對于參數的損耗梯度,并在訓練循環中重復更新參數。
If you have any suggestions, please leave a comment. I write articles regularly so you should consider following me to get more such articles in your feed.
如果您有任何建議,請發表評論。 我會定期撰寫文章,因此您應該考慮關注我,以便在您的供稿中獲取更多此類文章。
If you liked this article, you might as well love these:
如果您喜歡這篇文章,則不妨喜歡這些:
Visit my website to learn more about me and my work.
訪問我的網站以了解有關我和我的工作的更多信息。
翻譯自: https://towardsdatascience.com/machine-learning-visualized-11965ecc645c
機器學習 可視化
總結
以上是生活随笔為你收集整理的机器学习 可视化_机器学习-可视化的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 苹果7p停产了吗
- 下一篇: 强化学习-动态规划_强化学习-第4部分