當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

忍者必须死3 玩什么忍者_降维：忍者新手

發(fā)布時間：2023/12/15 编程问答 43 豆豆

生活随笔收集整理的這篇文章主要介紹了忍者必须死3 玩什么忍者_降维：忍者新手小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

忍者必須死3 玩什么忍者

What is high dimensional data?

什么是高維數(shù)據(jù)？

When the number of features exceed number of observations known as high dimnesional data?, it increases the?computational?complexity.generally any datasets with more than 10-features is known as high Dimensional Datasets.one of most efficiently used high dimensional dataset is MNIST dataset.

當(dāng)特征的數(shù)量超過被稱為高維數(shù)據(jù)的觀測值的數(shù)量時，它會增加計算復(fù)雜性。通常，具有10個以上特征的任何數(shù)據(jù)集都被稱為高維數(shù)據(jù)集。最高效使用的高維數(shù)據(jù)集之一是MNIST數(shù)據(jù)集。

Why We need to reduce the Dimensionality of Dataset ?

為什么我們需要減少數(shù)據(jù)集的維數(shù)？

Greater is the dimension more complex will be computation so it required high functional devices.Reduction in Dimension can give the better and effecient result with the computer of low computational power.Some features of dimensionality Reduction are;

尺寸越大，計算就越復(fù)雜，因此需要功能強(qiáng)大的設(shè)備。使用低計算能力的計算機(jī)，降低尺寸可以提供更好，更有效的結(jié)果。

i>Make computation less complex. ii>It Requires less Disk Space. iii>Have lower chance of model overfitting.

i>簡化計算。 ii>它需要更少的磁盤空間。 iii>過度擬合的機(jī)會較小。

Why we Avoid Model Overfitting ?

為什么我們避免模型過度擬合？

When we will be making model it shouldn’t be super accurate.Super accurate model will no longer generalize the new datasets, As a result of that it Would not give Efficient and Effective Forecasting.

當(dāng)我們要制作模型時，它不應(yīng)該是超級精確的。超級精確的模型將不再推廣新的數(shù)據(jù)集，因此，它不能給出有效的預(yù)測。

fig:1圖。1

Dimensionality Reduction Techniques;

降維技術(shù)；

i>t-SNE:-t-distribution Stochastic neighbourhood embeeding

i> t-SNE：-t分布隨機(jī)鄰域嵌入

ii>PCA:-Principal Component Analysis

ii> PCA：主成分分析

1. t-SNE (1. t-SNE)

Let’s Understand each and every piece of t-SNE in detail.It is most commonly used technique for high dimensional data visualization,gives clear and precise visualization of high dimensional data.It uses Feature Extraction Technique.

讓我們詳細(xì)了解每一個t-SNE。這是用于高維數(shù)據(jù)可視化的最常用技術(shù)，它提供了清晰，精確的高維數(shù)據(jù)可視化。它使用了特征提取技術(shù)。

fig:-t-SNE圖：-t-SNE

t:- t-Distribution ;

t：-t分布

t-Distributiont分布

t-Distribution or Student t-Distribution is a bell shaped curve is quite similar to Gaussian Distribution but it has heavy tail portion which resulted into high extreme values.

t分布或?qū)W生t分布是鐘形曲線，與高斯分布非常相似，但其尾部較重，導(dǎo)致極高的極值。

N:- Neighbourhood;

N：-鄰里；

fig:-2圖：-2

Consider Rectangular box as a high dimensional space.consider spherical portion and take a point as “xi” and remaining point “xj” in the spherical space will be neighbourhood point of “xi”.How we can determine whether “xi” and “xj” are geometrically close or not?

將矩形框視為高維空間。考慮球形部分，將一個點作為“ xi”，而球形空間中的剩余點“ xj”將成為“ xi”的鄰點。如何確定“ xi”和“ xj”幾何上是否接近？

if “xi” and “xj” satisfy upper formula then both will be considered as neighbourhood point.but if we consider those points which are outside the spherical space that will not be neighbourhood point of “xi” .Here we will use embeeding concept.

如果“ xi”和“ xj”滿足上式，則兩者都將被視為鄰點。但是，如果考慮那些不在球面空間內(nèi)的點而不是“ xi”的鄰點，則將使用嵌入概念。

E:- Embeeding ;

E：-嵌入；

It takes the points from high dimensional space and put those points into low dimensional space.By the help of embeeding we will resolve upper mentioned neighbourhood issue.

它從高維空間中獲取這些點并將其放入低維空間中。借助嵌入，我們將解決上面提到的鄰域問題。

fig:-3圖：-3

S:- Stochastic;

S：-隨機(jī)；

more commonly stochastic means probability.How does it affect data points in t-SNE model??

更常見的是隨機(jī)均值概率。它如何影響t-SNE模型中的數(shù)據(jù)點？

If we run the t-SNE model multiple time with same parametric value it give different visual observation?.So if we will be doing the high dimensional visualization we can not reach onto the conclusion using single visual we would have to do multiple visualization with multiple parameter to give the conclusion.

如果我們以相同的參數(shù)值多次運行t-SNE模型，則會給出不同的視覺觀察結(jié)果。因此，如果我們要進(jìn)行高維可視化，則無法使用單一視覺獲得結(jié)論，我們將不得不對多個參數(shù)進(jìn)行多重可視化得出結(jié)論。

For the multiple observation we would have to change the perplexity every time and it’s value should not be more than datapoints.

對于多重觀察，我們每次都必須更改困惑度，其值不應(yīng)超過數(shù)據(jù)點。

Perplexity < number of datapoints.

困惑度<數(shù)據(jù)點數(shù)。

what is perplexity ?

什么是困惑？

let’s take perplexity =5 it means there will be five neighbourhood point of “xi”

讓我們感到困惑= 5，這意味著將有五個鄰點“ xi”

coding implementation of t-SNE;

t-SNE的編碼實現(xiàn)；

From sklearn.manifold import TSNE
df=TSNE(n_components=2,random_state=0,perplexity=6,n_iter=500)

n_components convert the data into required number of dimension if we take n_components=2 means n-Dimensional data will be converted into 2-d Datset. random_state!=0 means each time we will run the code with parametric value will give different result. n_iter=500 greater is the iteration lesser will be the complexity.

如果我們將n_components=2 n_components則將數(shù)據(jù)轉(zhuǎn)換為所需的維數(shù)。這意味著n-Dimensional數(shù)據(jù)將轉(zhuǎn)換為2-d Datset。 random_state!=0意味著每次我們使用參數(shù)值運行代碼時，都會給出不同的結(jié)果。 n_iter=500越大，迭代次數(shù)越少，復(fù)雜度也會降低。

2. PCA (2. PCA)

PCA is one of the most commonly used dimensionality reduction technique. It is used for Machine learning modeling and high dimensional data visualization.It uses Feature Selection Technique.This technique choose one of the most important feature and removes all the less important features.Variance plays a major role here, those data which has high spreadness is considered as important features than the feature which has less spreadness.

PCA是最常用的降維技術(shù)之一。它用于機(jī)器學(xué)習(xí)建模和高維數(shù)據(jù)可視化，它使用特征選擇技術(shù)，該技術(shù)選擇最重要的特征之一并刪除所有次要的特征，在這里方差起主要作用，那些具有高分布性的數(shù)據(jù)是被認(rèn)為是重要的特征，而不是具有較小擴(kuò)展性的特征。

In PCA Features are converted into components i.e:-When all the duplicate data and missing data will be removed then the features will be known as components which gives precise modeling of data for furthur use.

在PCA中，要素將轉(zhuǎn)換為組件，即：-當(dāng)所有重復(fù)數(shù)據(jù)和丟失的數(shù)據(jù)將被刪除時，這些要素將被稱為組件，為進(jìn)一步使用提供精確的數(shù)據(jù)建模。

fig:-PCA圖：-PCA

Let’s take an example of converting two dimensional data into one dimensional data having features f1 and f2 Respectively.

讓我們以將二維數(shù)據(jù)轉(zhuǎn)換為分別具有特征f1和f2的一維數(shù)據(jù)為例。

fig:-4圖：-4

According to fig:-4 spreadness is high along y-Axis and low along x-Axis and as we know data with low variance will be removed from PCA.Hence in a process of converting 2D into 1D we will loose feature “F1”.now our new dataset will look like;

根據(jù)圖：-4，y軸上的擴(kuò)散度較高，x軸上的擴(kuò)散度較低，并且眾所周知，方差較小的數(shù)據(jù)將從PCA中刪除。因此，在將2D轉(zhuǎn)換為1D的過程中，我們將失去特征“ F1”。現(xiàn)在我們的新數(shù)據(jù)集將看起來像；

讓我們詳細(xì)分析PCA： (Let’s Analyse PCA in Detail:)

fig:-5圖：-5

As we see fig:-5 , 1st image is implicating that both the features(f1&f2) have same variance then how would we convert 2-D Dataset into 1-D Dataset ?

如圖5所示，第一張圖片暗示兩個特征(f1＆f2)具有相同的方差，那么我們?nèi)绾螌?-D數(shù)據(jù)集轉(zhuǎn)換為1-D數(shù)據(jù)集？

So if you observe second image you will get to know that I have rotated f1 in such a way that the point xi will have maximum variance on it. And now rotated feature f1 will be known as f1' and we take f2' perpendicular to f1’.

因此，如果您觀察第二張圖像，您將知道我已經(jīng)旋轉(zhuǎn)了f1，使得點xi在其上具有最大方差。現(xiàn)在旋轉(zhuǎn)的特征f1將被稱為f1'，我們將f2'垂直于f1'。

Why we have taken f1 ⊥ f2 ?

為什么我們選擇f1⊥f2？

we have been rotating f1 towards maximum variance direction so we cannot only rotate f1 and spare f2,both will be rotated?.we are trying to make such kind of rotation so that change in rotation of “f1” should not affect “f1 &f2” realtionship.for that we are taking f1 ⊥ f2. Now when we rotate f1 by angle “ Θ” f2 will also be rotated by angle “ Θ” which result into no change in relationship between f1' and f2’.

我們一直朝著最大方差方向旋轉(zhuǎn)f1，所以我們不僅可以旋轉(zhuǎn)f1，而且備用f2都將旋轉(zhuǎn)。我們正在嘗試進(jìn)行這種旋轉(zhuǎn)，以使“ f1”的旋轉(zhuǎn)變化不會影響“ f1＆f2”的實現(xiàn)因此，我們?nèi)1⊥f2。現(xiàn)在，當(dāng)我們將f1旋轉(zhuǎn)角度“Θ”時，f2也將旋轉(zhuǎn)角度“Θ”，這將導(dǎo)致f1'和f2'之間的關(guān)系不變。

Now after observing above process we can define PCA as;

現(xiàn)在，在觀察了上述過程之后，我們可以將PCA定義為：

We want to find the f1' in such a way that variance of “xi” projected on f1' will be maximum

我們希望以這樣的方式找到f1'：投影在f1'上的“ xi”的方差最大

Understanding Projection in detail ;

詳細(xì)了解投影;

fig:-6圖：-6

Here U1=unit vector,||U1||=1;

這里U1 =單位向量，|| U1 || = 1；

現(xiàn)在考慮PCA的定義并據(jù)此進(jìn)行數(shù)學(xué)表述； (Now considering Definition of PCA and making mathematical formulation from that;)

使用距離最小化的PCA： (PCA Using Distance Minimization:)

ResearchGateResearchGate

Before using distance minimization technique we would have to consider;

在使用距離最小化技術(shù)之前，我們必須考慮；

All the feature should be of same unit so if we calculate the distance it will give some relatable variation ;let’s consider we are calculating weight we have two features weight in kg and weight in pound so final outcome will give different-different measurement value.

所有特征都應(yīng)具有相同的單位，因此，如果我們計算距離，它將給出一些相關(guān)的變化；讓我們考慮一下我們正在計算重量時，我們有兩個特征：以千克為單位的重量和以磅為單位的重量，因此最終結(jié)果將給出不同的測量值。

so before applying distance minimization technique we will scale down the features into {0,1} which will make all the features unit independent and the process is done by using feature standardization.

因此，在應(yīng)用距離最小化技術(shù)之前，我們將按比例縮小特征到{0,1}，這將使所有特征單元獨立，并且使用特征標(biāo)準(zhǔn)化來完成該過程。

According to figure;

根據(jù)圖；

d=Perpendicular,||Xi||=Hypoteneous, Proj=Base now we will apply pythagoras theoram;

d =垂直，|| Xi || =次冪，Proj = Base現(xiàn)在我們將應(yīng)用畢達(dá)哥拉斯定理；

優(yōu)化問題的解(λ，V)：- (Solution For Optimization Problem(λ,V):-)

This section give us apporach to solve any dimensioinality reduction Question using PCA. Here;

本節(jié)為我們提供了使用PCA解決任何維數(shù)減少問題的方法。這里;

λ=特征值＆V =特征向量 (λ=eigen Value & V=eigen vector)

let we have X a matrix dataset;

讓我們有一個矩陣數(shù)據(jù)集；

步驟一： (STEP I:-)

Take The matrix Dataset “X” and Standardize it

取矩陣數(shù)據(jù)集“ X”并將其標(biāo)準(zhǔn)化

第二步： (STEP II:-)

Take the Covariance of standardize Matrix “X”.Covariance is Denoted by S.

取標(biāo)準(zhǔn)化矩陣“ X”的協(xié)方差。協(xié)方差用S表示。

第三步： (STEP III:-)

Computing Eigen Value and Eigen Vectors of S.

計算S的本征值和本征向量

步驟IV：- (STEP IV :-)

This step Decide how many dimension you want to extract from n- Dimensional data. In my case i am extracting 3-Dimnesional data hence;

此步驟確定要從n維數(shù)據(jù)中提取多少維。因此，就我而言，我正在提取3維數(shù)據(jù)。

碼： (CODE:)

from sklearn.decomposition import PCA
df2=PCA(n_components=3)

References ;

參考文獻(xiàn);

applied ai ,josh Starmer.

喬伊?史塔默 ( Josh Starmer) 申請了AI 。

翻譯自: https://medium.com/swlh/dimensionality-reduction-novice-to-ninja-part1-fcbcb7f59d8c