r中如何求变量的对数转换_对数转换以求阳性。
r中如何求變量的對(duì)數(shù)轉(zhuǎn)換
In Simple terms, log transform squashes or compresses range of large numbers and expands the range of small numbers. So if x is larger, then slower the log(x) increments.
用簡(jiǎn)單的術(shù)語來說,對(duì)數(shù)變換可以擠壓或壓縮大數(shù)范圍,并擴(kuò)大小數(shù)范圍。 因此,如果x較大,則log(x)的增量會(huì)變慢。
Log transform on range(1,1000), on x axis is real value and on y axis is log transformed value.在range(1,1000)上的對(duì)數(shù)變換,在x軸上是實(shí)數(shù)值,在y軸上是對(duì)數(shù)變換值。If you closely look at the plot above, which actually talks about log transformation on values ranging from 1 to 1000. As we can see from the plot, log has transformed values from [1,1000] into [0,7] range.
如果仔細(xì)看一下上面的圖,它實(shí)際上是關(guān)于從1到1000的值的對(duì)數(shù)轉(zhuǎn)換。從圖中可以看出,對(duì)數(shù)已將值從[1,1000]轉(zhuǎn)換為[0,7]范圍。
Note that how x values from 200 to 1000 get compressed into just ~5 and 7. So the larger the x, slower the log(x) increments.
請(qǐng)注意,如何將200到1000之間的x值壓縮為僅?5和7。因此,x越大,log(x)的增量越慢。
Log is only defined when x>0. Log 0 is undefined. It’s not a real number, let’s say Log (base 10) 0=x, so 10^x=0, if you try to solve this, you will see that no value of x raised to the power of 10 gives you zero. 10? is also 1.
僅在x> 0時(shí)定義對(duì)數(shù)。 日志0未定義。 它不是一個(gè)實(shí)數(shù) ,比方說對(duì)數(shù)(以10為底)0 = x,所以10 ^ x = 0,如果嘗試解決這個(gè)問題,您會(huì)發(fā)現(xiàn)x的任何數(shù)值都不提高到10的冪。 10?也是1。
Log transform is also known as variance stabilizing transform, which is useful when dealing with heavy tailed distributions. Log transform can make highly skewed distributions less skewed. So log transform reduces or removes skewness in data.
對(duì)數(shù)變換也稱為方差穩(wěn)定變換,在處理重尾分布時(shí)很有用。 對(duì)數(shù)變換可以使高度偏斜的分布減少偏斜。 因此,對(duì)數(shù)變換可以減少或消除數(shù)據(jù)的偏斜。
Log transform reduces or removes skewness and tries to make our distribution normal.對(duì)數(shù)變換可減少或消除偏斜,并嘗試使分布正常。使用對(duì)數(shù)變換作為特征工程技術(shù): (Using Log transform as feature engineering technique:)
To reduce or remove skewness in our data distribution and make it more normal (A.K.A Gaussian distribution) we can use log transformation on our input features (X).
為了減少或消除數(shù)據(jù)分布中的偏斜并使之更正態(tài)(又稱高斯分布),我們可以對(duì)輸入要素(X)使用對(duì)數(shù)變換。
We usually see heavy tailed distributions in real world data where values are right skewed(More larger values in distribution) and left skewed(More smaller values in distribution). Algorithms can be sensitive to such distribution of values and can under perform if the range is not properly normalized.
我們通常會(huì)在現(xiàn)實(shí)世界數(shù)據(jù)中看到重尾分布,其中值右偏(分布中的值更大)和左偏(分布中的值更小)。 算法可能對(duì)這種值的分布很敏感,如果范圍未正確歸一化,則算法可能會(huì)表現(xiàn)不佳。
Skewed distribution偏斜分布 Log transform distribution對(duì)數(shù)變換分布It is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Log transform reduces the range of values caused by outliers.
通常的做法是對(duì)數(shù)據(jù)應(yīng)用對(duì)數(shù)轉(zhuǎn)換 ,以使非常大和非常小的值都不會(huì)對(duì)學(xué)習(xí)算法的性能產(chǎn)生負(fù)面影響。 對(duì)數(shù)變換可減少由異常值引起的值范圍。
However it is important to remember that once log transform is done, observing data in its raw form will no longer have the same original meaning, as Log transforming the data.
但是,重要的是要記住,一旦完成對(duì)數(shù)轉(zhuǎn)換,以原始形式觀察數(shù)據(jù)將不再具有與對(duì)數(shù)進(jìn)行數(shù)據(jù)轉(zhuǎn)換相同的原始含義。
Next question is: when we do linear regression and get coefficient for X (Independent variable) how do we interpret log transformed independent variables (X) coefficient (Feature importance).
下一個(gè)問題是:當(dāng)我們進(jìn)行線性回歸并獲得X(獨(dú)立變量)的系數(shù)時(shí),我們?nèi)绾谓忉寣?duì)數(shù)變換后的獨(dú)立變量(X)系數(shù)(特征重要性)。
For Independent variable(X) Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units.
對(duì)于自變量(X),將系數(shù)除以100。這告訴我們,自變量增加1%,因變量增加(或減少)的系數(shù)為(系數(shù)/ 100)單位。
Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002.
示例 :系數(shù)為0.198。 0.198 / 100 = 0.00198。 自變量每增加1%,我們的因變量將增加約0.002。
Note: I’m also attaching a link below which dives deep into interpreting log transformed features.
注意 :我還將在下面附加一個(gè)鏈接,以深入了解解釋日志轉(zhuǎn)換的功能。
在目標(biāo)變量上使用對(duì)數(shù)變換: (Using Log transform on target variable:)
For example let’s consider a machine learning problem where you want to predict price of a house based on input features like (Area, number of bed rooms,…etc).
例如,讓我們考慮一個(gè)機(jī)器學(xué)習(xí)問題,您希望根據(jù)輸入特征(面積,床房數(shù)量等)來預(yù)測(cè)房屋價(jià)格 。
In this problem if you choose to create a linear regression model to fit prices(y) on X(Area, number of bed rooms….) and gradient descent in optimizing the model, the dataset would have some extreme prices (higher values properties) due to which your gradient descent algorithm would focus more on optimizing higher valued properties(Due to large error) and hence would produce a bad model. So performing a log transform on target variable makes sense when your performing linear regression.More importantly linear regression can predict values that are any real number (Negative values). If your model is far off, it can produce negative values, especially when predicting some of the cheaper houses. Real world values like price, income, stock price are positive so its good to log transform it before using linear regression otherwise the linear regression would predict negative values as predictions which doesn’t make sense.
在此問題中,如果您選擇創(chuàng)建線性回歸模型以在X(面積,床位數(shù)…。)上擬合價(jià)格(y),并在優(yōu)化模型時(shí)采用梯度下降,則數(shù)據(jù)集將具有一些極端價(jià)格(較高值的屬性)因此,您的梯度下降算法將更多地專注于優(yōu)化更高價(jià)值的屬性(由于誤差較大),因此會(huì)產(chǎn)生錯(cuò)誤的模型。 因此,在執(zhí)行線性回歸時(shí),對(duì)目標(biāo)變量執(zhí)行對(duì)數(shù)轉(zhuǎn)換是有意義的。更重要的是,線性回歸可以預(yù)測(cè)任何實(shí)數(shù)的值(負(fù)值)。 如果您的模型相差太遠(yuǎn),則可能會(huì)產(chǎn)生負(fù)值,尤其是在預(yù)測(cè)一些較便宜的房屋時(shí)。 諸如價(jià)格,收入,股票價(jià)格等現(xiàn)實(shí)世界中的值都是正值,因此在使用線性回歸之前最好先對(duì)其進(jìn)行對(duì)數(shù)轉(zhuǎn)換,否則線性回歸會(huì)將負(fù)值預(yù)測(cè)為沒有意義的預(yù)測(cè)。
Example: Predicting house prices示例:預(yù)測(cè)房?jī)r(jià)If you look at the above example, if you chose to go with RMSE as the cost function then the model would focus more on high valued properties and would perform bad. If you chose log(Actual)-log(Predicted) value it intuitively works in optimizing the model and thereby produce a good model.
如果看上面的示例,如果選擇將RMSE作為成本函數(shù),則該模型將更多地關(guān)注高價(jià)值的房地產(chǎn),并且表現(xiàn)不佳。 如果選擇log(Actual)-log(Predicted)值,則可以直觀地優(yōu)化模型,從而生成一個(gè)好的模型。
Model will be under more pressure on correcting large errors due to High valued properties so using log here makes sense.
由于具有高價(jià)值的屬性,模型在校正大錯(cuò)誤時(shí)將承受更大的壓力,因此在此處使用log是有意義的。
Converting log predictions back to actual values.將對(duì)數(shù)預(yù)測(cè)轉(zhuǎn)換回實(shí)際值。Converting to actual predictions using np.exp:But you would need actual predictions not the log of predictions, so you can always convert back to actual predictions using exponential of the value (Log(price)).
使用np.exp轉(zhuǎn)換為實(shí)際預(yù)測(cè):但是您將需要實(shí)際預(yù)測(cè)而不是預(yù)測(cè)的對(duì)數(shù),因此您始終可以使用值的指數(shù)(Log(price))轉(zhuǎn)換回實(shí)際預(yù)測(cè)。
日志損失以改善模型 (Log loss to improve models)
Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high log loss.
對(duì)數(shù)損失(與交叉熵有關(guān) )用于衡量分類模型的性能,其中預(yù)測(cè)輸入為0到1之間的概率值。我們的機(jī)器學(xué)習(xí)模型的目標(biāo)是最小化該值。 理想模型的對(duì)數(shù)損失為0。對(duì)數(shù)損失隨著預(yù)測(cè)概率與實(shí)際標(biāo)簽的偏離而增加。 因此,當(dāng)實(shí)際觀察標(biāo)簽為1時(shí)預(yù)測(cè)0.01的概率將很糟糕,并會(huì)導(dǎo)致較高的對(duì)數(shù)損失。
Log loss in binary classification setting二進(jìn)制分類設(shè)置中的對(duì)數(shù)丟失If you look at the above example when true value is 1 and predicted probability is 0.1, the log loss is high. Whereas when true value is 1 and predicted probability is 0.9, log loss is low.
如果看上面的示例,當(dāng)true值為1且預(yù)測(cè)概率為0.1時(shí),對(duì)數(shù)損失很高。 而當(dāng)真實(shí)值為1且預(yù)測(cè)概率為0.9時(shí),對(duì)數(shù)損失較低。
文本分類中的日志轉(zhuǎn)換(自然語言處理) (Log transformation in Text Classification (Natural language processing))
We use tf-idf method to encode our text data to fit machine learning models. Tf-idf uses log transform on inverse document frequency, so the word that appears in every single document will be effectively zeroed out, and a word that appears in very few documents will have an even larger count than before.
我們使用tf-idf方法對(duì)文本數(shù)據(jù)進(jìn)行編碼,以適合機(jī)器學(xué)習(xí)模型。 Tf-idf對(duì)文檔的逆頻率使用對(duì)數(shù)變換,因此每個(gè)文檔中出現(xiàn)的單詞將被有效地清零,而在很少文檔中出現(xiàn)的單詞的計(jì)數(shù)將比以前更大。
TF-IDF特遣部隊(duì)Please share this article if it helped you understand how important log is to machine learning. Do comment if you have any questions.
如果可以幫助您了解日志對(duì)機(jī)器學(xué)習(xí)的重要性,請(qǐng)分享此文章。 如有任何疑問,請(qǐng)發(fā)表評(píng)論。
GOOD DAY!
美好的一天!
Reference:1. https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/#:~:text=We%20simply%20log%2Dtransform%20x.&text=To%20interpret%20the%20slope%20coefficient%20we%20divide%20it%20by%20100.&text=The%20result%20is%20multiplying%20the,variable%20by%20the%20coefficient%2F100.2. http://wiki.fast.ai/index.php/Log_Loss#:~:text=Logarithmic%20loss%20(related%20to%20cross,a%20log%20loss%20of%200.
參考: 1. https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/#:~:text=We%20simply%20log%2Dtransform%20x.&text=To% 20解釋%20the%20slope%20coefficient%20we%20divide%20it%20by%20100。&text =%20result%20is%20乘以%20the,可變%20by%20the%20coefficient%2F100。 2. http://wiki.fast.ai/index.php/Log_Loss#:~:text=Logarithmic%20loss%20(related%20to%20cross,a%20log%20loss%20of%200。
翻譯自: https://medium.com/analytics-vidhya/log-transform-for-positivity-d3e1f183c804
r中如何求變量的對(duì)數(shù)轉(zhuǎn)換
總結(jié)
以上是生活随笔為你收集整理的r中如何求变量的对数转换_对数转换以求阳性。的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 唐宇迪机器学习课程数据集_最受欢迎的数据
- 下一篇: 美团脱颖而出的经验_使数据科学项目脱颖而