當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

UMAP降维算法原理详解和应用示例

發(fā)布時(shí)間：2023/12/20 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 UMAP降维算法原理详解和应用示例小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

降維不僅僅是為了數(shù)據(jù)可視化。它還可以識(shí)別高維空間中的關(guān)鍵結(jié)構(gòu)并將它們保存在低維嵌入中來克服“維度詛咒”

本文將介紹一種流行的降維技術(shù)Uniform Manifold Approximation and Projection (UMAP)的內(nèi)部工作原理，并提供一個(gè) Python 示例。

(UMAP) 如何工作的？

分析 UMAP 名稱

讓我們從剖析 UMAP 名稱開始，這將使我們對(duì)算法應(yīng)該做什么有一個(gè)大致的了解。

以下描述不是官方定義，而是我總結(jié)出來的可幫助我們理解 UMAP 的要點(diǎn)。

Projection ——通過投影點(diǎn)在平面、曲面或線上再現(xiàn)空間對(duì)象的過程或技術(shù)。也可以將其視為對(duì)象從高維空間到低維空間的映射。

Approximation——算法假設(shè)我們只有一組有限的數(shù)據(jù)樣本（點(diǎn)），而不是構(gòu)成流形的整個(gè)集合。因此，我們需要根據(jù)可用數(shù)據(jù)來近似流形。

Manifold——流形是一個(gè)拓?fù)淇臻g，在每個(gè)點(diǎn)附近局部類似于歐幾里得空間。一維流形包括線和圓，但不包括類似數(shù)字8的形狀。二維流形（又名曲面）包括平面、球體、環(huán)面等。

Uniform——均勻性假設(shè)告訴我們我們的數(shù)據(jù)樣本均勻（均勻）分布在流形上。但是，在現(xiàn)實(shí)世界中，這種情況很少發(fā)生。因此這個(gè)假設(shè)引出了在流形上距離是變化的概念。即，空間本身是扭曲的：空間根據(jù)數(shù)據(jù)顯得更稀疏或更密集的位置進(jìn)行拉伸或收縮。

綜上所述，我們可以將UMAP描述為:

一種降維技術(shù)，假設(shè)可用數(shù)據(jù)樣本均勻（Uniform）分布在拓?fù)淇臻g（Manifold）中，可以從這些有限數(shù)據(jù)樣本中近似（Approximation）并映射（Projection）到低維空間。

上面對(duì)算法的描述可能會(huì)對(duì)我們理解它的原理有一點(diǎn)幫助，但是對(duì)于UMAP是如何實(shí)現(xiàn)的仍然沒有說清楚。為了回答“如何”的問題，讓我們分析UMAP執(zhí)行的各個(gè)步驟。

UMAP執(zhí)行的步驟

我們可以將UMAP分為兩個(gè)主要步驟:

學(xué)習(xí)高維空間中的流形結(jié)構(gòu)

找到該流形的低維表示。

下面我們將把它分解成更小的部分，以加深我們對(duì)算法的理解。下面的地圖顯示了我們?cè)诜治雒總€(gè)部分工作流程。

1 — 學(xué)習(xí)流形結(jié)構(gòu)

在我們將數(shù)據(jù)映射到低維之前，肯定首先需要弄清楚它在高維空間中的樣子。

1.1.尋找最近的鄰居

UMAP 首先使用 Nearest-Neighbor-Descent 算法找到最近的鄰居。我們可以通過調(diào)整 UMAP 的 n_neighbors 超參數(shù)來指定我們想要使用多少個(gè)近鄰點(diǎn)。

試驗(yàn) n_neighbors 的數(shù)量很重要，因?yàn)樗刂?UMAP 如何平衡數(shù)據(jù)中的局部和全局結(jié)構(gòu)。它通過在嘗試學(xué)習(xí)流形結(jié)構(gòu)時(shí)限制局部鄰域的大小來實(shí)現(xiàn)。

本質(zhì)上，一個(gè)小的n_neighbors 值意味著我們需要一個(gè)非常局部的解釋，準(zhǔn)確地捕捉結(jié)構(gòu)的細(xì)節(jié)。而較大的 n_neighbors 值意味著我們的估計(jì)將基于更大的區(qū)域，因此在整個(gè)流形中更廣泛地準(zhǔn)確。

1.2.構(gòu)建一個(gè)圖

接下來，UMAP 需要通過連接之前確定的最近鄰來構(gòu)建圖。為了理解這個(gè)過程，我們需要將他分成幾個(gè)子步驟來解釋鄰域圖是如何形成的。

1.2.1 變化距離

正如對(duì) UMAP 名稱的分析所述，我們假設(shè)點(diǎn)在流形上均勻分布，這表明它們之間的空間根據(jù)數(shù)據(jù)看起來更稀疏或更密集的位置而拉伸或收縮的。

它本質(zhì)上意味著距離度量不是在整個(gè)空間中通用的，而是在不同區(qū)域之間變化的。我們可以通過在每個(gè)數(shù)據(jù)點(diǎn)周圍繪制圓圈/球體來對(duì)其進(jìn)行可視化，由于距離度量的不同，它們的大小似乎不同（見下圖）。

1.2.2 local_connectivity

接下來，我們要確保試圖學(xué)習(xí)的流形結(jié)構(gòu)不會(huì)導(dǎo)致許多不連通點(diǎn)。所以需要使用另一個(gè)超參數(shù)local_connectivity(默認(rèn)值= 1)來解決這個(gè)潛在的問題

當(dāng)我們?cè)O(shè)置local_connectivity=1 時(shí)，我們告訴高維空間中的每一個(gè)點(diǎn)都與另一個(gè)點(diǎn)相關(guān)聯(lián)。

1.2.3 模糊區(qū)域

你一定已經(jīng)注意到上面的圖也包含了模糊的圓圈延伸到最近的鄰居之外。這告訴我們，當(dāng)我們離感興趣的點(diǎn)越遠(yuǎn)，與其他點(diǎn)聯(lián)系的確定性就越小。

這兩個(gè)超參數(shù)（local_connectivity 和 n_neighbors）最簡(jiǎn)單的理解就是可以將他們視為下限和上限：

Local_connectivity(默認(rèn)值為1)：100%確定每個(gè)點(diǎn)至少連接到另一個(gè)點(diǎn)(連接數(shù)量的下限)。

n_neighbors(默認(rèn)值為15)：一個(gè)點(diǎn)直接連接到第 16 個(gè)以上的鄰居的可能性為 0%，因?yàn)樗跇?gòu)建圖時(shí)落在 UMAP 使用的局部區(qū)域之外。

2 到 15 ：有一定程度的確定性（>0% 但 <100%）一個(gè)點(diǎn)連接到它的第 2 個(gè)到第 15 個(gè)鄰居。

1.2.4 邊的合并

最后，我們需要了解上面討論的連接確定性是通過邊權(quán)重（w）來表達(dá)的。

由于我們采用了不同距離的方法，因此從每個(gè)點(diǎn)的角度來看，我們不可避免地會(huì)遇到邊緣權(quán)重不對(duì)齊的情況。例如，點(diǎn) A→B 的邊權(quán)重與 B→A 的邊權(quán)重不同。

UMAP 通過取兩條邊的并集克服了我們剛剛描述的邊權(quán)重不一致的問題。 UMAP 文檔解釋如下：

如果我們想將權(quán)重為 a 和 b 的兩條不同的邊合并在一起，那么我們應(yīng)該有一個(gè)權(quán)重為 𝑎+𝑏?𝑎?𝑏 的單邊。考慮這一點(diǎn)的方法是，權(quán)重實(shí)際上是邊（1-simplex）存在的概率。組合權(quán)重就是至少存在一條邊的概率。

最后，我們得到一個(gè)連接的鄰域圖，如下所示：

2 — 尋找低維表示

從高維空間學(xué)習(xí)近似流形后，UMAP 的下一步是將其投影（映射）到低維空間。

2.1.最小距離

與第一步不同，我們不希望在低維空間表示中改變距離。相反，我們希望流形上的距離是相對(duì)于全局坐標(biāo)系的標(biāo)準(zhǔn)歐幾里得距離。

從可變距離到標(biāo)準(zhǔn)距離的轉(zhuǎn)換的轉(zhuǎn)換也會(huì)影響與最近鄰居的距離。因此，我們必須傳遞另一個(gè)名為 min_dist（默認(rèn)值=0.1）的超參數(shù)來定義嵌入點(diǎn)之間的最小距離。

本質(zhì)上，我們可以控制點(diǎn)的最小分布，避免在低維嵌入中許多點(diǎn)相互重疊的情況。

2.2.最小化成本函數(shù)（Cross-Entropy）

指定最小距離后，該算法可以開始尋找較好的低維流形表示。 UMAP 通過最小化以下成本函數(shù)（也稱為交叉熵 (CE)）來實(shí)現(xiàn)：

最終目標(biāo)是在低維表示中找到邊的最優(yōu)權(quán)值。這些最優(yōu)權(quán)值隨著上述交叉熵函數(shù)的最小化而出現(xiàn)，這個(gè)過程是可以通過隨機(jī)梯度下降法來進(jìn)行優(yōu)化的

就是這樣!UMAP的工作現(xiàn)在完成了，我們得到了一個(gè)數(shù)組，其中包含了指定的低維空間中每個(gè)數(shù)據(jù)點(diǎn)的坐標(biāo)。

Python中使用UMAP

上面我們已經(jīng)介紹UMAP的知識(shí)點(diǎn)，現(xiàn)在我們?cè)赑ython中進(jìn)行實(shí)踐。

我們將在MNIST數(shù)據(jù)集(手寫數(shù)字的集合)上應(yīng)用UMAP，以說明我們?nèi)绾纬晒Φ胤蛛x數(shù)字并在低維空間中顯示它們。
我們將使用以下數(shù)據(jù)和庫:

1、Scikit-learn庫，MNIST數(shù)字?jǐn)?shù)據(jù)(load_digits);將數(shù)據(jù)分割為訓(xùn)練和測(cè)試樣本(train_test_split);

2、UMAP庫執(zhí)行降維;

3、Plotly和Matplotlib用于數(shù)據(jù)可視化;

4、Pandas和Numpy用于數(shù)據(jù)操作。

第一步是導(dǎo)入上面列出的庫。

# Data manipulation import pandas as pd # for data manipulation import numpy as np # for data manipulation# Visualization import plotly.express as px # for data visualization import matplotlib.pyplot as plt # for showing handwritten digits# Skleran from sklearn.datasets import load_digits # for MNIST data from sklearn.model_selection import train_test_split # for splitting data into train and test samples# UMAP dimensionality reduction from umap import UMAP

接下來，我們加載MNIST數(shù)據(jù)并顯示前10個(gè)手寫數(shù)字的圖像。

# Load digits data digits = load_digits()# Load arrays containing digit data (64 pixels per image) and their true labels X, y = load_digits(return_X_y=True)# Some stats print('Shape of digit images: ', digits.images.shape) print('Shape of X (main data): ', X.shape) print('Shape of y (true labels): ', y.shape)# Display images of the first 10 digits fig, axs = plt.subplots(2, 5, sharey=False, tight_layout=True, figsize=(12,6), facecolor='white') n=0 plt.gray() for i in range(0,2):for j in range(0,5):axs[i,j].matshow(digits.images[n])axs[i,j].set(title=y[n])n=n+1 plt.show()

接下來，我們將創(chuàng)建一個(gè)用于繪制3D散點(diǎn)圖的函數(shù)，我們可以多次重用該函數(shù)來顯示UMAP降維的結(jié)果。

def chart(X, y):#--------------------------------------------------------------------------## This section is not mandatory as its purpose is to sort the data by label # so, we can maintain consistent colors for digits across multiple graphs# Concatenate X and y arraysarr_concat=np.concatenate((X, y.reshape(y.shape[0],1)), axis=1)# Create a Pandas dataframe using the above arraydf=pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])# Convert label data type from float to integerdf['label'] = df['label'].astype(int)# Finally, sort the dataframe by labeldf.sort_values(by='label', axis=0, ascending=True, inplace=True)#--------------------------------------------------------------------------## Create a 3D graphfig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)# Update chart looksfig.update_layout(title_text='UMAP',showlegend=True,legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),scene_camera=dict(up=dict(x=0, y=0, z=1), center=dict(x=0, y=0, z=-0.1),eye=dict(x=1.5, y=-1.4, z=0.5)),margin=dict(l=0, r=0, b=0, t=0),scene = dict(xaxis=dict(backgroundcolor='white',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),),yaxis=dict(backgroundcolor='white',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),),zaxis=dict(backgroundcolor='lightgrey',color='black', gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),)))# Update marker sizefig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))fig.show()

將 UMAP 應(yīng)用于我們的數(shù)據(jù)

現(xiàn)在，我們將之前加載到 X 中的 MNIST 數(shù)字?jǐn)?shù)據(jù)。 X (1797,64) 的形狀告訴我們我們有 1,797 個(gè)數(shù)字，每個(gè)數(shù)字由 64 個(gè)維度組成。

我們將使用 UMAP 將維數(shù)從 64 降到 3。我已經(jīng)列出了 UMAP 中可用的每個(gè)超參數(shù)，并簡(jiǎn)要說明了它們的作用。

雖然在本示例中，我將大部分超參數(shù)設(shè)置為默認(rèn)值，但你可以嘗試改變它們來查看它們?nèi)绾斡绊懡Y(jié)果。

# Configure UMAP hyperparameters reducer = UMAP(n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.n_components=3, # default 2, The dimension of the space to embed into.metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.random_state=42, # default: None, If int, random_state is the seed used by the random number generator;metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different. #target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.verbose=False, # default False, Controls verbosity of logging.unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded. )# Fit and transform the data X_trans = reducer.fit_transform(X)# Check the shape of the new data print('Shape of X_trans: ', X_trans.shape)

以上代碼將UMAP應(yīng)用于我們的MNIST數(shù)據(jù)，并打印轉(zhuǎn)換后的數(shù)組的形狀，以確認(rèn)我們已經(jīng)成功地將維數(shù)從64降至3。

現(xiàn)在，我們可以使用前面創(chuàng)建的圖表繪圖功能來可視化我們的三維數(shù)字?jǐn)?shù)據(jù)。我們用一行簡(jiǎn)單的代碼調(diào)用函數(shù)，傳遞我們想要可視化的數(shù)組。

可以在這里查看：https://chart-studio.plotly.com/create/?fid=SolClover:166#/

結(jié)果看起來非常好，數(shù)字集群之間有明顯的分離。有趣的是，數(shù)字1形成了三個(gè)不同的集群，這可以用人們書寫數(shù)字1的不同方式來解釋:

注意，1的底數(shù)和數(shù)字2的底數(shù)很像。我們可以在一小簇紅色的1中找到這些案例，它與綠色的2非常接近。

監(jiān)督的UMAP

正如本文開頭所提到的，我們還可以以監(jiān)督的方式使用UMAP來幫助減少數(shù)據(jù)的維數(shù)。

在執(zhí)行監(jiān)督降維時(shí)，除了圖像數(shù)據(jù)(X_train數(shù)組)，我們還需要將標(biāo)簽數(shù)據(jù)(y_train數(shù)組)傳遞給fit_transform方法(參見下面的代碼)。

另外，我對(duì)超參數(shù)做了一些其他的小修改，將min_dist=0.5和local_connectivity=2設(shè)置為更好的可視化和更好的測(cè)試示例結(jié)果。

# Split data into training and testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)# Configure UMAP hyperparameters reducer2 = UMAP(n_neighbors=100, n_components=3, n_epochs=1000, min_dist=0.5, local_connectivity=2, random_state=42,)# Training on MNIST digits data - this time we also pass the true labels to a fit_transform method X_train_res = reducer2.fit_transform(X_train, y_train)# Apply on a test set X_test_res = reducer2.transform(X_test)# Print the shape of new arrays print('Shape of X_train_res: ', X_train_res.shape) print('Shape of X_test_res: ', X_test_res.shape)

現(xiàn)在，我們已經(jīng)成功地使用監(jiān)督UMAP方法降維，我們可以繪制3D散點(diǎn)圖來顯示結(jié)果。

chart(X_train_res, y_train)

https://chart-studio.plotly.com/create/?fid=SolClover:169

我們可以看到，UMAP形成了非常緊密的簇，每個(gè)數(shù)字之間有相當(dāng)大的距離。

現(xiàn)在，我們?yōu)闇y(cè)試數(shù)據(jù)創(chuàng)建相同的3D圖，以查看UMAP是否能夠成功地將新的數(shù)據(jù)點(diǎn)放置到這些集群中。

chart(X_test_res, y_test)

https://chart-studio.plotly.com/create/?fid=SolClover:172

結(jié)果非常好，只有幾個(gè)數(shù)字放在了錯(cuò)誤的簇中。特別的是，看起來算法在處理數(shù)字3時(shí)遇到了困難，有幾個(gè)例子位于7、8和5的旁邊。

總結(jié)

感謝您閱讀這篇長(zhǎng)文，我希望它的每一部分都能讓您更深入地了解這個(gè)偉大的算法是如何運(yùn)行的。

一般來說，UMAP具有堅(jiān)實(shí)的數(shù)學(xué)基礎(chǔ)，它通常比t-SNE等類似的降維算法做得更好。

UMAP的秘訣在于保持低維空間中相對(duì)全局距離的同時(shí)推斷局部和全局結(jié)構(gòu)的能力。這些能力使我們能夠找到特定的解決方案，比如找到數(shù)字1和2的手寫形式之間的相似之處。

作者：Saul Dobilas

總結(jié)

以上是生活随笔為你收集整理的UMAP降维算法原理详解和应用示例的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

UMAP降维算法原理详解和应用示例

(UMAP) 如何工作的？

UMAP執(zhí)行的步驟

Python中使用UMAP

總結(jié)

總結(jié)