pca 主成分分析_超越普通PCA:非线性主成分分析
pca 主成分分析
TL;DR: PCA cannot handle categorical variables because it makes linear assumptions about them. Nonlinear PCA addresses this issue by warping the feature space to optimize explained variance. (Key points at bottom.)
TL; DR: PCA無法處理分類變量,因為它對它們進行了線性假設。 非線性PCA通過扭曲特征空間以優化解釋的方差來解決此問題。 (關鍵點在底部。)
Principal Component Analysis (PCA) has been one of the most powerful unsupervised learning techniques in machine learning. Given multi-dimensional data, PCA will find a reduced number of n uncorrelated (orthogonal) dimensions, attempting to retain as much variance in the original dataset as possible. It does this by constructing new features (principle components) as linear combinations of existing columns.
主成分分析(PCA)是機器學習中最強大的無監督學習技術之一。 給定多維數據,PCA將發現數量減少的n個不相關(正交)維,嘗試在原始數據集中保留盡可能多的方差。 它通過將新功能(原理組件)構造為現有列的線性組合來實現。
However, PCA cannot handle nominal — categorical, like state — or ordinal — categorical and sequential, like letter grades (A+, B-, C, …) — columns. This is because a metric like variance, which PCA explicitly attempts to model, is an inherently numerical measure. If one were to use PCA on data with nominal and ordinal columns, it would end up making silly assumptions like ‘California is one-half New Jersey’ or ‘A+ minus four equals D’, since it must make those kinds of relationships to operate.
但是,PCA無法處理名義(類別,如狀態)或排序(類別和順序),如字母等級(A +,B-,C等)的列。 這是因為PCA明確嘗試建模的類似方差的度量標準是固有的數字度量。 如果要在具有標稱和序數列的數據上使用PCA,最終將做出愚蠢的假設,例如“加利福尼亞州是新澤西州的一半”或“ A +減去四等于D”,因為它必須使這種關系起作用。
Rephrased in relation to a mathematical perspective, PCA relies on linear relationships, that is, the assumption that the distance between “strongly disagree” and “disagree” is the same as the difference from “disagree” to “neutral”. In almost every real-world dataset, these sorts of linear relationships do not exist for all columns.
從數學角度重新描述,PCA依賴于線性關系,即“強烈不同意”和“不同意”之間的距離與“不同意”到“中立”的差異相同的假設。 在幾乎每個現實世界的數據集中,并非所有列都存在此類線性關系。
Additionally, using one-hot encoding — that is, converting categorical data into vectors of ones and zeroes — results in an extremely sparse and information-parched multidimensional space that PCA cannot perform well on, since several features contain only two unique values.
此外,使用一鍵編碼(即將分類數據轉換為一和零的向量)會導致PCA無法很好地執行的極為稀疏且信息匱乏的多維空間,因為多個功能僅包含兩個唯一值。
Nonlinear PCA rectifies this aspect of PCA by generalizing methods to approach dimensionality reduction not only for numerical features, but for categorical and ordinal variables. This is done through categorical quantification.
非線性PCA通過泛化方法來修正PCA的這一方面,不僅針對數字特征,而且針對分類和有序變量也都采用降維方法。 這是通過分類量化完成的。
Categorical quantification (CQ) is exactly what its name suggests: it attaches a numerical representation to each category, converting categorical columns into numerical ones, such that the performance of the PCA model (like explained variance) is maximized. CQ optimally places categories on a numerical dimension instead of making assumptions about them.
分類量化(CQ)正是其名稱的含義:它將數字表示形式附加到每個類別,將分類列轉換為數字列,從而使PCA模型的性能(如解釋的方差)最大化。 CQ最佳地將類別放在數字維度上,而不是對其進行假設。
This information can be very enriching. For instance, we might be able to say that Washington and Idaho have very similar structures in other parts of the data because they are placed so closely, or that California and Virginia are nowhere similar because they are placed far apart. In this sense, CQ is not only enriching the PCA model with categorical data but also giving us a look into the structures of the data by state.
這些信息可以非常豐富。 例如,我們也許可以說,華盛頓和愛達荷州在數據的其他部分具有非常相似的結構,因為它們放置得太近了;或者,加利福尼亞和弗吉尼亞州的相似之處在于它們的位置很遠。 從這個意義上講,CQ不僅用分類數據豐富了PCA模型,而且使我們可以按狀態查看數據的結構。
An alternative view of CQ is through a line plot. Although in the case of nominal data, the order of columns is arbitrary and there do not need to be connecting lines, it is visualized in this way to demonstrate the nominal level of analysis. If a feature’s level is specified as nominal, it can take on any numerical value.
CQ的替代視圖是通過折線圖。 盡管在名義數據的情況下,列的順序是任意的,并且不需要連接線,但可以通過這種方式將其可視化以展示名義分析水平。 如果將特征級別指定為標稱值,則它可以采用任何數值。
On the other hand, if a feature level is specified as ordinal, the restriction is that the order must be preserved. For instance, the relation between ‘A’ and ‘B’ in that ‘A’ is better than ‘B’ must be kept, which can be represented with A=0 and B=5 (assuming 0 is the best) or A=25 and B=26, as long as B is never less than A. This helps retain the structure of ordinal data.
另一方面,如果將要素級別指定為序數,則限制是必須保留順序。 例如,必須保持“ A”與“ B”之間的關系,因為“ A”優于“ B”,可以用A=0和B=5 (假設0為最佳)或A=25和B=26 ,只要B永遠不小于A。這有助于保留序數數據的結構。
Note that the next data point is always larger or equal to the previous data point. This is a restriction on ordinal-level features.請注意,下一個數據點始終大于或等于前一個數據點。 這是對序數級功能的限制。Like CQ for nominal data, this is tremendously insightful. For instance, we notice that within pluses and minuses of letter grades (A+, A, A-), there is not much difference, but the difference between X- and Y+ (X and Y being sequential letters) always leads to a large jump, particularly the difference between D and F. To reiterate the point above — this chart is generated by finding optimal values for categories such that the PCA model performs best (explained variance is highest).
就像名義數據的CQ一樣,這是非常有見地的。 例如,我們注意到在字母等級(A +,A,A-)的正負之間并沒有太大差異,但是X-和Y +之間的差異( X和Y是順序字母)總是導致重申以上幾點-通過查找類別的最佳值以使PCA模型表現最佳(解釋方差最高)來生成此圖。
Note that because CQ determines the space between data points (e.g. that the difference between A and A- is much less than that of D and F), it warps the space in which these points lie. Instead of assuming a linear relationship (A and A- are as close as D and F), CQ distorts the distances between common intervals — hence, nonlinear PCA.
請注意,由于CQ決定了數據點之間的間隔(例如,A和A-之間的差遠小于D和F的差),因此它將扭曲這些點所在的空間。 CQ不會假設線性關系(A和A-與D和F接近),而是會扭曲公共間隔之間的距離,因此會扭曲非線性 PCA。
To give an idea of the nonlinearities that can arise when the distance between sequential intervals are altered, here’s a 3 by 3 square in distorted space:
為了讓人們理解當順序間隔之間的距離改變時可能出現的非線性,這里是一個3 x 3平方的扭曲空間:
By using categorical quantification, the feature space is distorted — in a good way! Intervals are selectively chosen such that the performance of PCA is maximized. Nonlinear PCA, in this sense, not only can be thought of as an encoding method for ordinal and nominal variables but also increases the global strength of the PCA model.
通過使用分類定量,可以很好地扭曲特征空間! 選擇間隔以使PCA的性能最大化。 從這個意義上講,非線性PCA不僅可以看作是序數和名義變量的編碼方法,而且可以提高PCA模型的整體強度。
Although the mathematics behind Nonlinear PCA is very rich, generally speaking, NPCA uses the same methods as PCA (like eigenvalue solving, etc.), but uses CQ to derive the most information and benefit to the model.
盡管非線性PCA背后的數學非常豐富,但通常來說,NPCA使用與PCA相同的方法(例如特征值求解等),但是使用CQ來獲得最多的信息并從模型中受益。
關鍵點 (Key Points)
- PCA cannot handle nominal (categorical) or ordinal (sequential) columns because it is an inherently numerical algorithm and makes silly linear assumptions about these types of data. PCA無法處理標稱(分類)或序數(順序)列,因為它是固有的數值算法,并且對這些類型的數據進行了愚蠢的線性假設。
- Nonlinear PCA uses categorical quantification, which finds the best numerical representation of unique column values such that the performance (explained variance) of the PCA model using the transformed columns is optimized. 非線性PCA使用分類量化,它可以找到唯一列值的最佳數值表示形式,從而可以優化使用轉換列的PCA模型的性能(解釋方差)。
- Categorical quantification is a very insightful data mining method, and can give lots of insight into the structures of the data through the lens of a categorical value. Unfortunately, using Nonlinear PCA means that the coefficients of principal components are less interpretable (but still understandable, just to a less statistical rigor). 分類量化是一種非常有見地的數據挖掘方法,可以通過分類價值的鏡頭深入了解數據的結構。 不幸的是,使用非線性PCA意味著主成分的系數難以解釋(但仍然可以理解,只是對統計的嚴格要求較低)。
All images created by author.
作者創作的所有圖像。
翻譯自: https://medium.com/analytics-vidhya/beyond-ordinary-pca-nonlinear-principal-component-analysis-54a93915a702
pca 主成分分析
總結
以上是生活随笔為你收集整理的pca 主成分分析_超越普通PCA:非线性主成分分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 设计数据密集型应用程序_设计数据密集型应
- 下一篇: 全局变量和局部变量命名规则_变量范围和L