當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

数据归一化matlab及python 实现

發(fā)布時(shí)間：2023/12/13 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了数据归一化matlab及python 实现小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

更多細(xì)節(jié)請(qǐng)查看
https://www.zhihu.com/question/20455227

歸一化的目的簡(jiǎn)而言之，是使得沒(méi)有可比性的數(shù)據(jù)變得具有可比性，同時(shí)又保持相比較的兩個(gè)數(shù)據(jù)之間的相對(duì)關(guān)系。
歸一化首先在維數(shù)非常多的時(shí)候，可以防止某一維或某幾維對(duì)數(shù)據(jù)影響過(guò)大，其次可以程序可以運(yùn)行更快。

數(shù)據(jù)歸一化應(yīng)該針對(duì)屬性，而不是針對(duì)每條數(shù)據(jù)，針對(duì)每條數(shù)據(jù)是完全沒(méi)有意義的，因?yàn)橹皇堑缺壤s放，對(duì)之后的分類(lèi)沒(méi)有任何作用。

歸一化的方法：具體可查看
http://blog.csdn.net/yudf2010/article/details/40779953
http://blog.csdn.net/facingthesuncn/article/details/17258415
http://blog.csdn.net/acdreamers/article/details/44664205
http://blog.csdn.net/junmuzi/article/details/48917361
http://blog.csdn.net/lkj345/article/details/50352385

Three common methods are used to perform feature normalization in machine learning algorithms.

Rescaling
The simplest method is rescaling the range of features by linear function. The common formula is given as:

where x is the original value, x’ is the normalized value.

The equation (1) rescales data into [0,1], and the equation (2) rescales data into [-1,1].

Note: the parameters max(x) and min(x) should be computed in the training data only, but will be used in the training, validation, and testing data later.
我們必須使用同樣的方法縮放訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù)。例如，假設(shè)我們把訓(xùn)練數(shù)據(jù)的第一個(gè)屬性從[-20,+20]縮放到[-1, +1]，那么如果測(cè)試數(shù)據(jù)的第一個(gè)屬性屬于區(qū)間[-30, +35]，我們必須將測(cè)試數(shù)據(jù)轉(zhuǎn)變成[-1.5, +1.75]。

縮放的最主要優(yōu)點(diǎn)是能夠避免大數(shù)值區(qū)間的屬性過(guò)分支配了小數(shù)值區(qū)間的屬性。另一個(gè)優(yōu)點(diǎn)能避免計(jì)算過(guò)程中數(shù)值復(fù)雜度。

線性函數(shù)轉(zhuǎn)換將一系列數(shù)據(jù)映射到相應(yīng)區(qū)間，例如將所有數(shù)據(jù)映射到1~100，可用下列函數(shù)
y=((x-min)/(max-min))*(100-1)+1；
min是數(shù)據(jù)集中最小值，max是最大值
同理若將所有數(shù)據(jù)映射到a~b，可用下列函數(shù)
y=((x-min)/(max-min))*(b-a)+a；

There are also some methods to normalize the features using non-linear function, such as

logarithmic function:

inverse tangent function:

sigmoid function:

Standardization
Feature standardization makes the values of each feature in the data have zero-mean and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines,logistic regression, and neural networks). The general formula is given as:

where is the standard deviation of the feature .

舉例說(shuō)明（MATLAB）：

% X is train data; X=[790 3977 849 1294 1927 1105 204 1329 768 5037 1135 1330 1925 1459 275 1487 942 2793 820 814 1617 942 155 976 916 2798 901 932 1599 910 182 1135 1006 2864 1052 1005 1618 839 196 1081]; % 方法一 [Z,mu,sigma] = zscore(X) % 方法二 [s t] = size(X) Y=(X-repmat(mean(X),s,1))./repmat(std(X),s,1); % Xnew is test data [m n] = size(Xnew); % Ynew = ((Xnew-repmat(mu,m,1))./repmat(sigma,m,1))

Scaling to unit length
Another option that is widely used in machine-learning is to scale the components of a feature vector such the complete vector has length one:

This is especially important if the Scalar Metric is used as a distance measure in the following learning steps.

3. Some cases you don’t need data normalization

3.1 using a similarity function instead of distance function

You can propose a similarity function rather than a distance function and plug it in a kernel (technically this function must generate positive-definite matrices).

3.2 Random Feforest

For random forests on the other hand, since one feature is never compared in magnitude to other features, the ranges don’t matter. It’s only the range of one feature that is split at each stage.

Random Forest is invariant to monotonic transformations of individual features. Translations or per feature scalings will not change anything for the Random Forest. SVM will probably do better if your features have roughly the same magnitude, unless you know apriori that some feature is much more important than others, in which case it’s okay for it to have a larger magnitude.

備注：
（1）在分類(lèi)、聚類(lèi)算法中，需要使用距離來(lái)度量相似性的時(shí)候、或者使用PCA技術(shù)進(jìn)行降維的時(shí)候，第二種方法(Z-score standardization)表現(xiàn)更好。
（2）在不涉及距離度量、協(xié)方差計(jì)算、數(shù)據(jù)不符合正太分布的時(shí)候，可以使用第一種方法或其他歸一化方法。比如圖像處理中，將RGB圖像轉(zhuǎn)換為灰度圖像后將其值限定在[0 255]的范圍。

不同類(lèi)型數(shù)據(jù)要進(jìn)行融合，也得將不同類(lèi)型數(shù)據(jù)歸一化處理后進(jìn)行運(yùn)算

關(guān)于python 實(shí)現(xiàn)歸一化可參考
莫煩python - Sklearn
關(guān)于使用sklearn進(jìn)行數(shù)據(jù)預(yù)處理 —— 歸一化/標(biāo)準(zhǔn)化/正則化
機(jī)器學(xué)習(xí)中的數(shù)據(jù)預(yù)處理（sklearn preprocessing）

神經(jīng)網(wǎng)絡(luò)為什么要?dú)w一化?
具體可查看：
http://nnetinfo.com/nninfo/showText.jsp?id=37

隨機(jī)森林是否需要?dú)w一化
具體可查看
https://www.quora.com/Should-inputs-to-random-forests-be-normalized?srid=3EJy&st=ns
http://stackoverflow.com/questions/8961586/do-i-need-to-normalize-or-scale-data-for-randomforest-r-package
http://stats.stackexchange.com/questions/41820/random-forests-with-bagging-and-range-of-feature-values

備注：
具體可查看
https://www.zhihu.com/question/20455227
https://www.zhihu.com/people/maigo
問(wèn)題1：在進(jìn)行數(shù)據(jù)分析的時(shí)候，什么情況下需要對(duì)數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化處理？
答：主要看模型是否具有伸縮不變性。

有些模型在各個(gè)維度進(jìn)行不均勻伸縮后，最優(yōu)解與原來(lái)不等價(jià)，例如SVM。對(duì)于這樣的模型，除非本來(lái)各維數(shù)據(jù)的分布范圍就比較接近，否則必須進(jìn)行標(biāo)準(zhǔn)化，以免模型參數(shù)被分布范圍較大或較小的數(shù)據(jù)dominate。

有些模型在各個(gè)維度進(jìn)行不均勻伸縮后，最優(yōu)解與原來(lái)等價(jià)，例如logistic regression。對(duì)于這樣的模型，是否標(biāo)準(zhǔn)化理論上不會(huì)改變最優(yōu)解。但是，由于實(shí)際求解往往使用迭代算法，如果目標(biāo)函數(shù)的形狀太“扁”，迭代算法可能收斂得很慢甚至不收斂。所以對(duì)于具有伸縮不變性的模型，最好也進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化。

具體可查看：
https://www.zhihu.com/question/30038463/answer/50491149

問(wèn)題2：機(jī)器學(xué)習(xí)數(shù)據(jù)歸一化的的方法有哪些？適合于什么樣的數(shù)據(jù)？
答：
1. 最值歸一化。比如把最大值歸一化成1，最小值歸一化成-1；或把最大值歸一化成1，最小值歸一化成0。適用于本來(lái)就分布在有限范圍內(nèi)的數(shù)據(jù)。
2. 均值方差歸一化，一般是把均值歸一化成0，方差歸一化成1。適用于分布沒(méi)有明顯邊界的情況，受outlier影響也較小。
具體可查看：
https://www.zhihu.com/question/26546711/answer/62085061

問(wèn)題3：為什么 feature scaling 會(huì)使 gradient descent 的收斂更好?
答：如果不歸一化，各維特征的跨度差距很大，目標(biāo)函數(shù)就會(huì)是“扁”的：

（圖中橢圓表示目標(biāo)函數(shù)的等高線，兩個(gè)坐標(biāo)軸代表兩個(gè)特征）
這樣，在進(jìn)行梯度下降的時(shí)候，梯度的方向就會(huì)偏離最小值的方向，走很多彎路。

如果歸一化了，那么目標(biāo)函數(shù)就“圓”了：

看，每一步梯度的方向都基本指向最小值，可以大踏步地前進(jìn)。
具體可查看：
https://www.zhihu.com/question/37129350/answer/70592743

問(wèn)題4：數(shù)據(jù)特征的歸一化，是對(duì)整個(gè)矩陣還是對(duì)每一維特征？
答：整體做歸一化相當(dāng)于各向同性的放縮，做了也沒(méi)有用。
各維分別做歸一化會(huì)丟失各維方差這一信息，但各維之間的相關(guān)系數(shù)可以保留。
如果本來(lái)各維的量綱是相同的，最好不要做歸一化，以盡可能多地保留信息。
如果本來(lái)各維的量綱是不同的，那么直接做PCA沒(méi)有意義，就需要先對(duì)各維分別歸一化。
具體可查看：
https://www.zhihu.com/question/31186681/answer/50929278

MATLAB代碼實(shí)現(xiàn)

http://blog.csdn.net/breeze5428/article/details/27308995
http://blog.csdn.net/yb536/article/details/41050181

function [X_norm, mu, sigma] = featureNormalize(X) %FEATURENORMALIZE Normalizes the features in X % FEATURENORMALIZE(X) returns a normalized version of X where % the mean value of each feature is 0 and the standard deviation % is 1. This is often a good preprocessing step to do when % working with learning algorithms. % You need to set these values correctly % ====================== YOUR CODE HERE ====================== % Instructions: First, for each feature dimension, compute the mean % of the feature and subtract it from the dataset, % storing the mean value in mu. Next, compute the % standard deviation of each feature and divide % each feature by it's standard deviation, storing % the standard deviation in sigma. % % Note that X is a matrix where each column is a % feature and each row is an example. You need % to perform the normalization separately for % each feature. % % Hint: You might find the 'mean' and 'std' functions useful. % % let's calculate the total number of features %mean([1; 2; 3; 4]) returns 2.5 %X(:,1) is the price of the houses %X(:,2) is the number of bedrooms %mu(1,1) = accessing the position of mu %mu(1,2) = 2 % updating the positions of a given matrix's column %d = [1 2; 3 4; 5 6] %d(:, 1) = d(:, 1) .- d(:, 1) % As defined in the class notes, n is the number of features. n = size(X, 2); % the given values of mu and sigma as initial zeros mu = zeros(1, n); sigma = zeros(1, n); for featureColumn = 1:n mu(1, featureColumn) = mean(X(:, featureColumn)); X(:, featureColumn) = X(:, featureColumn) .- mu(1, featureColumn); sigma(1, featureColumn) = std(X(:, featureColumn)); X(:, featureColumn) = X(:, featureColumn) ./ sigma(1, featureColumn); end; % updating the value of the return X_norm X_norm = X; %disp(mu) %disp(sigma) % ============================================================ end

此外在matlab中也可以采用

[Z,mu,sigma] = zscore(X)

實(shí)現(xiàn)各個(gè)維度的標(biāo)準(zhǔn)化特征Ｚ。在求得mu和sigma后，如果新來(lái)一個(gè)特征向量ｘ，外面可以采用z = (x–mu)./sigma求得該向量的標(biāo)準(zhǔn)化向量。

matlab對(duì)訓(xùn)練集和測(cè)試集歸一化

關(guān)于是否將訓(xùn)練集和測(cè)試集放在一起進(jìn)行歸一化有待討論，若在一起，則會(huì)讓測(cè)試集受到訓(xùn)練集的影響，導(dǎo)致訓(xùn)練集和測(cè)試集不相互獨(dú)立。正確的做法是記錄下訓(xùn)練集的歸一化方法，用該方法對(duì)測(cè)試集單獨(dú)進(jìn)行歸一化，matlab中的mapminmax函數(shù)提供了相應(yīng)的機(jī)制。

對(duì)于一條新的數(shù)據(jù)，應(yīng)該先按照訓(xùn)練集的歸一化方法進(jìn)行歸一化，再進(jìn)行分類(lèi)。
可通過(guò)

inst = [1 2 3 4; 2 3 4 5; 3 4 5 6]; [inst_norm, settings] = mapminmax(inst); test = [1 3 5]'; test_norm = mapminmax('apply', test, settings)test_norm =-1.0000 -0.3333 0.3333

解釋說(shuō)明：該數(shù)據(jù)inst為3行4列，每一列代表一個(gè)數(shù)據(jù)，每一行代表同一個(gè)屬性，即數(shù)據(jù)個(gè)數(shù)為4個(gè)，屬性個(gè)數(shù)為3個(gè)，由于歸一化是針對(duì)于屬性歸一化的，所以每個(gè)屬性都對(duì)于一個(gè)歸一化的函數(shù)。
y = (ymax-ymin)*(x-xmin)/(xmax-xmin) + ymin，此時(shí)ymax=1，ymin=-1.

屬性1：y=2*（x-1）/(4-1) + (-1); 屬性2：y=2*（x-2）/(5-2) + (-1); 屬性3：y=2*（x-3）/(6-3) + (-1);

那么對(duì)于test = [1 3 5]’歸一化后，可得出相應(yīng)的屬性值為

屬性1：y=2*（1-1）/(4-1) + (-1)=-1; 屬性2：y=2*（3-2）/(5-2) + (-1)=-0.333; 屬性3：y=2*（5-3）/(6-3) + (-1)=0.333;

這里的數(shù)據(jù)都是列數(shù)為樣本個(gè)數(shù)，行數(shù)為屬性個(gè)數(shù)。
其中settings記錄了訓(xùn)練集的歸一化方法，得到以下歸一化結(jié)果
mapminmax會(huì)跳過(guò)NaN數(shù)據(jù)，最好的方法是歸一化之后，將NaN賦值成0。

inst_norm(find(isnan(inst_norm))) = 0;

詳情請(qǐng)查看
http://blog.csdn.net/lkj345/article/details/50352385

下面我們對(duì)mapminmax的用法進(jìn)行詳解

[Y,PS] = mapminmax(X) [Y,PS] = mapminmax(X,FP) Y = mapminmax('apply',X,PS) X = mapminmax('reverse',Y,PS)

用實(shí)例來(lái)講解,測(cè)試數(shù)據(jù) x1 = [1 2 4], x2 = [5 2 3];

[y,ps] = mapminmax(x1)y =-1.0000 -0.3333 1.0000 ps = name: 'mapminmax'xrows: 1xmax: 4xmin: 1xrange: 3yrows: 1ymax: 1ymin: -1yrange: 2no_change: 0gain: 0.6667xoffset: 1

其中y是對(duì)進(jìn)行某種規(guī)范化后得到的數(shù)據(jù),這種規(guī)范化的映射記錄在結(jié)構(gòu)體ps中.讓我們來(lái)看一下這個(gè)規(guī)范化的映射到底是怎樣的?

Algorithm
It is assumed that X has only finite real values, and that the elements of each row are not all equal.

y = (ymax-ymin)*(x-xmin)/(xmax-xmin) + ymin;

[關(guān)于此算法的一個(gè)問(wèn)題.算法的假設(shè)是每一行的元素都不想相同,那如果都相同怎么辦?實(shí)現(xiàn)的辦法是,如果有一行的元素都相同比如xt = [1 1 1],此時(shí)xmax = xmin = 1,把此時(shí)的變換變?yōu)閥 = ymin,matlab內(nèi)部就是這么解決的.否則該除以0了,沒(méi)有意義!]

也就是說(shuō)對(duì)x1 = [1 2 4]采用這個(gè)映射 f: 2*(x-xmin)/(xmax-xmin)+(-1),就可以得到y(tǒng) = [ -1.0000 -0.3333 1.0000]
我們來(lái)看一下是不是: 對(duì)于x1而言 xmin = 1,xmax = 4;
則y(1) = 2*(1 - 1)/(4-1)+(-1) = -1;
y(2) = 2*(2 - 1)/(4-1)+(-1) = -1/3 = -0.3333;
y(3) = 2*(4-1)/(4-1)+(-1) = 1;
看來(lái)的確就是這個(gè)映射來(lái)實(shí)現(xiàn)的.
對(duì)于上面algorithm中的映射函數(shù) 其中ymin,和ymax是參數(shù),可以自己設(shè)定,默認(rèn)為-1,1;
比如

[y,ps] = mapminmax(x1); ps.ymin = 0; [y,ps] = mapminmax(x1,ps)y =0 0.3333 1.0000ps = name: 'mapminmax'xrows: 1xmax: 4xmin: 1xrange: 3yrows: 1ymax: 1ymin: 0yrange: 1no_change: 0gain: 0.3333xoffset: 1

則此時(shí)的映射函數(shù)為: f: 1*(x-xmin)/(xmax-xmin)+(0)

如果對(duì)x1 = [1 2 4]采用了某種規(guī)范化的方式, 現(xiàn)在要對(duì)x2 = [5 2 3]采用同樣的規(guī)范化方式[同樣的映射],如下可辦到:

[y1,ps] = mapminmax(x1); y2 = mapminmax('apply',x2,ps)y2 =1.6667 -0.3333 0.3333

即對(duì)x1采用的規(guī)范化映射為: f: 2*(x-1)/(4-1)+(-1),(記錄在ps中),對(duì)x2也要采取這個(gè)映射.
x2 = [5,2,3],用這個(gè)映射我們來(lái)算一下.
y2(1) = 2(5-1)/(4-1)+(-1) = 5/3 = 1+2/3 = 1.66667
y2(2) = 2(2-1)/(4-1)+(-1) = -1/3 = -0.3333
y2(3) = 2(3-1)/(4-1)+(-1) = 1/3 = 0.3333

X = mapminmax(‘reverse’,Y,PS)的作用就是進(jìn)行反歸一化,講歸一化的數(shù)據(jù)反歸一化再得到原來(lái)的數(shù)據(jù):

[y1,ps] = mapminmax(x1); xtt = mapminmax('reverse',y1,ps)xtt =1 2 4

詳情可查看
http://www.ilovematlab.cn/thread-47224-1-1.html

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯，堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的数据归一化matlab及python 实现的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。