當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维

發布時間：2023/12/15 编程问答 52 豆豆

生活随笔收集整理的這篇文章主要介紹了使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用mnist數據集

It is easy for us to visualize two or three dimensional data, but once it goes beyond three dimensions, it becomes much harder to see what high dimensional data looks like.

對我們來說，可視化二維或三維數據很容易，但是一旦它超出了三維，就很難看到高維數據的外觀。

Today we are often in a situation that we need to analyze and find patterns on datasets with thousands or even millions of dimensions, which makes visualization a bit of a challenge. However, a tool that can definitely help us better understand the data is dimensionality reduction.

如今，我們經常處于需要分析和查找具有數千甚至上百萬個維度的數據集的模式的情況，這使可視化成為一個挑戰。但是，絕對可以幫助我們更好地理解數據的工具是降維。

In this post, I will discuss t-SNE, a popular non-linear dimensionality reduction technique and how to implement it in Python using sklearn. The dataset I have chosen here is the popular MNIST dataset.

在本文中，我將討論t-SNE(一種流行的非線性降維技術)以及如何使用sklearn在Python中實現該技術。我在這里選擇的數據集是流行的MNIST數據集。

好奇心表 (Table of Curiosities)

What is t-SNE and how does it work?

什么是t-SNE，它如何工作？

How is t-SNE different with PCA?

t-SNE與PCA有何不同？

How can we improve upon t-SNE?

我們如何改善t-SNE？

What are the limitations?

有什么限制？

What can we do next?

接下來我們該怎么辦？

總覽 (Overview)

T-Distributed Stochastic Neighbor Embedding, or t-SNE, is a machine learning algorithm and it is often used to embedding high dimensional data in a low dimensional space [1].

T分布隨機鄰居嵌入(t-SNE)是一種機器學習算法，通常用于在低維空間中嵌入高維數據[1]。

In simple terms, the approach of t-SNE can be broken down into two steps. The first step is to represent the high dimensional data by constructing a probability distribution P, where the probability of similar points being picked is high, whereas the probability of dissimilar points being picked is low. The second step is to create a low dimensional space with another probability distribution Q that preserves the property of P as close as possible.

簡單來說，t-SNE的方法可以分為兩個步驟。第一步是通過構造概率分布P來表示高維數據，其中相似點被拾取的概率較高，而相異點被拾取的概率較低。第二步是創建具有另一個概率分布Q的低維空間，該概率分布Q保持P的屬性盡可能接近。

In step 1, we compute the similarity between two data points using a conditional probability p. For example, the conditional probability of j given i represents that x_j would be picked by x_i as its neighbor assuming neighbors are picked in proportion to their probability density under a Gaussian distribution centered at x_i [1]. In step 2, we let y_i and y_j to be the low dimensional counterparts of x_i and x_j, respectively. Then we consider q to be a similar conditional probability for y_j being picked by y_i and we employ a student t-distribution in the low dimension map. The locations of the low dimensional data points are determined by minimizing the Kullback–Leibler divergence of probability distribution P from Q.

在步驟1中，我們使用條件概率p計算兩個數據點之間的相似度。例如，給定i的條件概率j表示x_j將被x_i作為其鄰居，并假設在以x_i [1]為中心的高斯分布下，按其概率密度成比例地選擇了鄰居。在步驟2中，我們讓y_i和y_j分別為x_i和x_j的低維對應物。然后我們認為q是y_i選擇y_j的相似條件概率，并且在低維圖中使用學生t分布 。通過最小化概率分布P與Q的Kullback-Leibler散度來確定低維數據點的位置。

For more technical details of t-SNE, check out this paper.

有關t-SNE的更多技術細節，請查閱本文。

I have chosen the MNIST dataset from Kaggle (link) as the example here because it is a simple computer vision dataset, with 28x28 pixel images of handwritten digits (0–9). We can think of each instance as a data point embedded in a 784-dimensional space.

我選擇了Kaggle( link )中的MNIST數據集作為示例，因為它是一個簡單的計算機視覺數據集，具有28x28像素數字(0–9)的像素圖像。我們可以將每個實例視為嵌入784維空間的數據點。

To see the full Python code, check out my Kaggle kernel.

要查看完整的Python代碼，請查看我的Kaggle內核。

Without further ado, let’s get to the details!

事不宜遲，讓我們來談談細節！

勘探 (Exploration)

Note that in the original Kaggle competition, the goal is to build a ML model using the training images with true labels that can accurately predict the labels on the test set. For our purposes here we will only use the training set.

請注意，在原始的Kaggle競賽中，目標是使用帶有真實標簽的訓練圖像構建ML模型，該標簽可以準確預測測試集上的標簽。為了我們的目的，我們將僅使用訓練集。

As usual, we check its shape first:

與往常一樣，我們首先檢查其形狀：

train.shape
--------------------------------------------------------------------
(42000, 785)

There are 42K training instances. The 785 columns are the 784 pixel values, as well as the ‘label’ column.

有42K個訓練實例。 785列是784像素值，以及“標簽”列。

We can check the label distribution as well:

我們也可以檢查標簽分布：

label = train["label"]
label.value_counts()
--------------------------------------------------------------------
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64

Principal Component Analysis (PCA)

主成分分析(PCA)

Before we implement t-SNE, let’s try PCA, a popular linear method for dimensionality reduction.

在實施t-SNE之前，讓我們嘗試PCA，一種流行的線性降維方法。

After we standardize the data, we can transform our data using PCA (specify ‘n_components’ to be 2):

在對數據進行標準化之后，我們可以使用PCA轉換數據(將'n_components'指定為2)：

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCAtrain = StandardScaler().fit_transform(train)
pca = PCA(n_components=2)
pca_res = pca.fit_transform(train)

Let’s make a scatter plot to visualize the result:

讓我們繪制一個散點圖以可視化結果：

sns.scatterplot(x = pca_res[:,0], y = pca_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');2D Scatter plot of MNIST data after applying PCA應用PCA后MNIST數據的2D散點圖

As shown in the scatter plot, PCA with two components does not sufficiently provide meaningful insights and patterns about the different labels. We know one drawback of PCA is that the linear projection can’t capture non-linear dependencies. Let’s try t-SNE now.

如散點圖所示，具有兩個組件的PCA不足以提供有關不同標簽的有意義的見解和模式。我們知道PCA的一個缺點是線性投影無法捕獲非線性依賴性。讓我們現在嘗試t-SNE。

T-SNE with sklearn

帶Sklearn的T-SNE

We will implement t-SNE using sklearn.manifold (documentation):

我們將使用sklearn.manifold ( 文檔 )實現t-SNE：

from sklearn.manifold import TSNEtsne = TSNE(n_components = 2, random_state=0)
tsne_res = tsne.fit_transform(train)
sns.scatterplot(x = tsne_res[:,0], y = tsne_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');2D Scatter plot of MNIST data after applying t-SNE應用t-SNE后MNIST數據的二維散點圖

Now we can see that the different clusters are more separable compared with the result from PCA. Here are a few observations on this plot:

現在我們可以看到，與PCA的結果相比，不同的聚類更可分離。以下是該圖的一些觀察結果：

The “5” data points seem to be more spread out compared with the other clusters such as “2” and “4”.

與“ 2”和“ 4”之類的其他群集相比，“ 5”數據點似乎更分散。

There are a few “5” and “8” data points that are similar to “3”s.

有一些“ 5”和“ 8”數據點與“ 3”相似。

There are two clusters of “7” and “9” where they are next to each other.

有兩個簇“ 7”和“ 9”彼此相鄰。

An Approach that Combines Both

結合兩者的方法

It is generally recommended to use PCA or TruncatedSVD to reduce the number of dimension to a reasonable amount (e.g. 50) before applying t-SNE [2].

通常建議在應用t-SNE之前使用PCA或TruncatedSVD將尺寸數量減少到合理的數量(例如50)[2]。

Doing so can reduce the level of noise as well as speed up the computations.

這樣做可以降低噪聲水平并加快計算速度。

Let’s try PCA (50 components) first and then apply t-SNE. Here is the scatter plot:

讓我們先嘗試PCA(50個組件)，然后再應用t-SNE。這是散點圖：

2D Scatter plot of MNIST data after applying PCA(50 components) and then t-SNE先應用PCA(50個分量)再進行t-SNE后的MNIST數據的二維散點圖

Compared with the previous scatter plot, wecan now separate out the 10 clusters better. here are a few observations:

與以前的散點圖相比，我們現在可以更好地分離出10個群集。以下是一些觀察結果：

Most of the “5” data points are not as spread out as before, despite a few that still look like “3”.

盡管很少有5個數據點看起來仍然像“ 3”個數據點，但大多數“ 5”個數據點的分布都沒有像以前那樣分散。

There is one cluster of “7” and one cluster of “9” now.

現在有一個集群“ 7”和一個集群“ 9”。

Besides, the runtime in this approach decreased by over 60%.

此外，這種方法的運行時間減少了60％以上。

For more interactive 3D scatter plots, check out this post.

有關更多交互式3D散點圖，請查看此文章。

局限性 (Limitations)

Here are a few limitations of t-SNE:

這是t-SNE的一些限制：

Unlike PCA, the cost function of t-SNE is non-convex, meaning there is a possibility that we would be stuck in a local minima.

與PCA不同，t-SNE的成本函數是非凸的，這意味著我們有可能陷入局部最小值。

Similar to other dimensionality reduction techniques, the meaning of the compressed dimensions as well as the transformed features becomes less interpretable.

與其他降維技術類似，壓縮尺寸以及變換后的特征的含義變得難以解釋。

下一步 (Next Steps)

Here are a few things that we can try as next steps:

以下是一些我們可以嘗試做的下一步操作：

Hyperparameter tuning — Try tune ‘perplexity’ and see its effect on the visualized output.

超參數調整—嘗試調整“困惑”，并查看其對可視化輸出的影響。

Try some of the other non-linear techniques such as Uniform Manifold Approximation and Projection (UMAP), which is the generalization of t-SNE and it is based on Riemannian geometry.

嘗試其他一些非線性技術，例如統一流形逼近和投影 (UMAP)，它是t-SNE的推廣，它基于黎曼幾何。

Train ML models on the transformed data and compare its performance with those from models without dimensionality reduction.

在轉換后的數據上訓練ML模型，并將其性能與未降維的模型的性能進行比較。

摘要 (Summary)

Let’s quickly recap.

讓我們快速回顧一下。

We implemented t-SNE using sklearn on the MNIST dataset. We compared the visualized output with that from using PCA, and lastly, we tried a mixed approach which applies PCA first and then t-SNE.

我們在MNIST數據集上使用sklearn實現了t-SNE。我們將可視化輸出與使用PCA的可視化輸出進行了比較，最后，我們嘗試了一種混合方法，該方法首先應用PCA，然后再應用t-SNE。

I hope you enjoyed this blog post and please share any thoughts that you may have :)

我希望您喜歡這篇博客文章，并請分享您可能有的任何想法:)

Check out my other post on Chi-square test for independence:

看看我關于卡方檢驗的其他文章是否具有獨立性：

翻譯自: https://towardsdatascience.com/dimensionality-reduction-using-t-distributed-stochastic-neighbor-embedding-t-sne-on-the-mnist-9d36a3dd4521