机器学习 异常值检测_异常值是否会破坏您的机器学习预测? 寻找最佳解决方案
機器學習 異常值檢測
內部AI (Inside AI)
In the world of data, we all love Gaussian distribution (also known as a normal distribution). In real-life, seldom we have normal distribution data. It is skewed, missing data points or has outliers.
在數據世界中,我們都喜歡高斯分布(也稱為正態分布)。 在現實生活中,很少有正態分布數據。 它歪斜,缺少數據點或有異常值。
As I mentioned in my earlier article, the strength of Scikit-learn inadvertently works to its disadvantage. Machine learning developers esp. with relatively lesser experience implements an inappropriate algorithm for prediction without grasping particular algorithms salient feature and limitations. We have seen earlier the reason we should not use the decision tree regression algorithm in making a prediction involving extrapolating the data.
正如我在前一篇文章中提到的那樣 ,Scikit-learn的優勢在無意中起到了不利的作用。 機器學習開發人員,尤其是。 經驗相對較少的人在不掌握特定算法的顯著特征和局限性的情況下,實施了不合適的預測算法。 前面我們已經看到了在進行涉及外推數據的預測時不應使用決策樹回歸算法的原因。
The success of any machine learning modelling always starts with understanding the existing dataset on which model will be trained. It is imperative to understand the data well before starting any modelling. I will even go to an extent to say that the prediction accuracy of the model is directly proportional to the extent we know the data.
任何機器學習建模的成功總是始于了解將在其上訓練模型的現有數據集。 必須在開始任何建模之前充分了解數據。 我什至會在某種程度上說模型的預測準確性與我們知道數據的程度成正比。
Objective
目的
In this article, we will see the effect of outliers on various regression algorithms available in Scikit-learn, and learn about the most appropriate regression algorithm to apply in such a situation. We will start with a few techniques to understand the data and then train a few of the Sklearn algorithms with the data. Finally, we will compare the training results of the algorithms and learn the potential best algorithms to apply in the case of outliers.
在本文中,我們將看到異常值對Scikit-learn中可用的各種回歸算法的影響,并了解適用于這種情況的最合適的回歸算法。 我們將從幾種了解數據的技術入手,然后根據數據訓練一些Sklearn算法。 最后,我們將比較算法的訓練結果,并學習適用于異常值的潛在最佳算法。
Training Dataset
訓練數據集
The training data consists of 200,000 records of 3 features (independent variable) and 1 target value (dependent variable). The true coefficient of the features 1, feature 2 and feature 3 is 77.74, 23.34, and 7.63 respectively.
訓練數據包含200,000條具有3個特征(獨立變量)和1個目標值(獨立變量)的記錄。 特征1,特征2和特征3的真實系數分別為77.74、23.34和7.63。
Training Data — 3 Independent and 1 Dependent Variable訓練數據-3個獨立變量和1個因變量Step 1- First, we will import the packages required for data analysis and regressions.
步驟1- 首先,w e將導入數據分析和回歸所需的軟件包。
We will be comparing HuberRegressor, LinearRegression, Ridge, SGDRegressor, ElasticNet, PassiveAggressiveRegressor and Linear Support Vector Regression (SVR), hence we will import the respective packages.
我們將比較HuberRegressor,LinearRegression,Ridge,SGDRegressor,ElasticNet,PassiveAggressiveRegressor和Linear Support Vector Regression(SVR),因此將分別導入軟件包。
Most of the time, few data points are missing in the training data. In that case, if any particular features have a high proportion of null values then it may be better not consider that feature. Else, if a few data points are missing for a feature then either can drop those particular records from training data, or we can replace those missing values with mean, median or constant values. We will import SimpleImputer to fill the missing values.
大多數時候,訓練數據中很少有數據點缺失。 在那種情況下,如果任何特定功能具有高比例的空值,則最好不要考慮該功能。 否則,如果某個功能缺少一些數據點,則可以從訓練數據中刪除那些特定的記錄,或者我們可以將這些丟失的值替換為均值,中值或常數。 我們將導入SimpleImputer來填充缺少的值。
We will import the Variance Inflation Factor to find the severity of multicollinearity among the features. We will need Matplotlib and seaborn to draw various plots for analysis.
我們將導入方差通貨膨脹因子以找到特征之間多重共線性的嚴重性。 我們將需要Matplotlib和seaborn繪制各種圖進行分析。
from sklearn.linear_model import HuberRegressor,LinearRegression ,Ridge,SGDRegressor,ElasticNet,PassiveAggressiveRegressorfrom sklearn.svm import LinearSVRimport pandas as pdfrom sklearn.impute import SimpleImputer
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step 2- In the code below, training data containing 200.000 records are read from excel file into the PandasDataframe called “RawData”. Independent variables are saved into a new DataFrame.
步驟2-在下面的代碼中,將包含200.000條記錄的訓練數據從excel文件中讀取到名為“ RawData”的PandasDataframe中。 自變量保存到新的DataFrame中。
RawData=pd.read_excel("Outlier Regression.xlsx")Data=RawData.drop(["Target"], axis=1)
Step 3-Now we will start by getting a sense of the training data and understanding it. In my opinion, a heatmap is a good option to understand the relationship between different features.
步驟3-現在,我們將首先了解并理解訓練數據。 我認為,熱圖是了解不同功能之間關系的一個不錯的選擇。
sns.heatmap(Data.corr(), cmap="YlGnBu", annot=True)plt.show()
It shows that none of the independent variables (features) is closely related to each other. In case you would like to learn more on the approach and selection criteria of independent variables for regression algorithms, then please read my earlier article on it.
它表明沒有一個自變量(特征)彼此密切相關。 如果您想了解更多有關回歸算法自變量的方法和選擇標準的信息,請閱讀我以前的文章。
How to identify the right independent variables for Machine Learning Supervised Algorithms?
如何為機器學習監督算法識別正確的自變量?
Step 4- After getting a sense of the correlation among the features in the training data next we will look into the minimum, maximum, median etc. of each feature value range. This will help us to ascertain whether there are any outliers in the training data and the extent of it. Below code instructs to draw boxplots for all the features.
步驟4-在了解了訓練數據中各特征之間的相關性之后,我們將研究每個特征值范圍的最小值,最大值,中位數等。 這將有助于我們確定訓練數據中是否存在異常值及其范圍。 以下代碼指示繪制所有功能的箱線圖。
sns.boxplot(data=Data, orient="h",palette="Set2")plt.show()
In case you don’t know to read the box plot then please refer the Wikipedia to learn more on it. Feature values are spread across a wide range with a big difference from the median value. This confirms the presence of outlier values in the training dataset.
如果您不知道閱讀箱形圖,請參考Wikipedia以了解更多信息。 特征值分布在很大范圍內,與中值有很大差異。 這確認了訓練數據集中存在異常值。
Step 5- We will check if there are any null values in the training data and take any action required before going anywhere near modelling.
第5步-我們將檢查訓練數據中是否有任何空值,并采取任何必要的措施,然后再進行建模。
print (Data.info())Here we can see that there are total 200,000 records in the training data and all three features have few values missing. For example, feature 1 has 60 values (200000 –199940) missing.
在這里我們可以看到訓練數據中總共有200,000條記錄,并且所有三個功能都缺少幾個值。 例如,特征1缺少60個值(200000 –199940)。
Step 6- We use SimpleImputer to fill the missing values with the mean values of the other records for a feature. In the below code, we use the strategy= “mean” for the same. Scikit-learn provides different strategies viz. mean, median, most frequent and constant value to replace the missing value. I will suggest you please self explore the effect of each strategy on the training model as a learning exercise.
第6步-我們使用SimpleImputer用功能的其他記錄的平均值填充缺失值。 在下面的代碼中,我們同樣使用strategy =“ mean”。 Scikit-learn提供了不同的策略。 平均,中位數,最頻繁和恒定的值來代替缺失值。 我建議您作為學習練習,自我探索每種策略對訓練模型的影響。
In the code below, we have created an instance of SimpleImputer with strategy “Mean” and then fit the training data into it to calculate the mean of each feature. Transform method is used to fill the missing values with the mean value.
在下面的代碼中,我們創建了一個策略為“ Mean”的SimpleImputer實例,然后將訓練數據擬合到其中以計算每個特征的均值。 變換方法用于用平均值填充缺失值。
imputer = SimpleImputer(strategy="mean")imputer.fit(Data)
TransformData = imputer.transform(Data)
X=pd.DataFrame(TransformData, columns=Data.columns)
Step 7- It is good practice to check the features once more after replacing the missing values to ensure we do not have any null (blank) values remaining in our training dataset.
第7步-好的做法是在替換缺失值之后再次檢查特征,以確保我們的訓練數據集中沒有剩余的任何空(空白)值。
print (X.info())We can see that now all the features have non-null i.e non-blank values for 200,000 records.
我們可以看到,現在所有功能都具有非空值,即200,000條記錄的非空白值。
Step 8- Before we start training the algorithms, let us check the Variance inflation factor (VIF) among the independent variables. VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity. I will encourage you all to read the Wikipedia page on Variance inflation factor to gain a good understanding of it.
步驟8-在開始訓練算法之前,讓我們檢查自變量之間的方差膨脹因子 ( VIF )。 VIF在普通最小二乘回歸分析中量化多重共線性的嚴重性。 它提供了一個指標,用于衡量由于共線性而導致估計的回歸系數的方差(估計的標準偏差的平方)增加了多少。 我鼓勵大家閱讀Wikipedia頁面上關于方差膨脹因子的知識 ,以更好地理解它。
vif = pd.DataFrame()vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
In the above code, we calculate the VIF of each independent variables and print it. In general, we should aim for the VIF of less than 10 for the independent variables. We have seen earlier in the heatmap that none of the variables is highly correlated, and the same is reflecting in the VIF index among the features.
在上面的代碼中,我們計算每個獨立變量的VIF并將其打印出來。 通常,我們應將自變量的VIF設置為小于10。 我們早先在熱圖中看到,變量沒有高度相關,并且在功能之間的VIF索引中也反映出同樣的情況。
Step 9- We will extract the target i.e. dependent variable values from the RawData dataframe and save it in a data series.
第9步-我們將從RawData數據幀中提取目標(即因變量值)并將其保存在數據系列中。
y=RawData["Target"].copy()Step 10- We will be evaluating the performance of various regressors viz. HuberRegressor, LinearRegression, Ridge and others on outlier dataset. In the below code, we created instances of the various regressors.
步驟10-我們將評估各種回歸器的性能。 離群數據集上的HuberRegressor,LinearRegression,Ridge等。 在下面的代碼中,我們創建了各種回歸變量的實例。
Huber = HuberRegressor()Linear = LinearRegression()
SGD= SGDRegressor()
Ridge=Ridge()
SVR=LinearSVR()
Elastic=ElasticNet(random_state=0)
PassiveAggressiveRegressor= PassiveAggressiveRegressor()
Step 11- We declared a list with instances of the regressions to pass it in sequence in a for a loop later.
第11步-我們聲明了一個帶有回歸實例的列表,以便稍后在循環中依次傳遞它。
estimators = [Linear,SGD,SVR,Huber, Ridge, Elastic,PassiveAggressiveRegressor]Step 12- Finally, we will train the models in sequence with the training data set and print the coefficients of the features calculated by the model.
第12步-最后,我們將使用訓練數據集按順序訓練模型,并打印由模型計算出的特征的系數。
for i in estimators:reg= i.fit(X,y)
print(str(i)+" Coefficients:", np.round(i.coef_,2))
print("**************************")
We can observe a wide range of coefficients calculated by different models based on their optimisation and regularisation factors. Feature 1 coefficient calculated coefficient varies from 29.31 to 76.88.
我們可以觀察到基于不同模型的優化和正則化因子計算出的各種系數。 特征1系數計算的系數從29.31到76.88。
Due to a few outliers in the training dataset a few models, like linear and ridge regression predicted coefficients nowhere near the true coefficients. Huber regressor is quite robust to the outliers ensuring loss function is not heavily influenced by the outliers while not completely ignoring their effects like TheilSenRegressor and RANSAC Regressor. Linear SVR also more options in the selection of penalties and loss functions and performed better than other models.
由于訓練數據集中存在一些離群值,因此一些模型(例如線性和嶺回歸)預測的系數遠不及真實系數。 Huber回歸器對異常值非常強大,可以確保損失函數不受異常值的嚴重影響,同時又不完全忽略其影響,例如TheilSenRegressor和RANSAC回歸器。 線性SVR在罰分和損失函數的選擇上也有更多選擇,并且比其他模型表現更好。
Learning Action for you- We trained different models with a training data set containing outliers and then compared the predicted coefficients with actual coefficients. I will encourage you all to follow the same approach and compare the prediction metrics viz. R2 score, mean squared error (MSE), RMSE of different models trained with outlier dataset.
為您學習的行動-我們使用包含異常值的訓練數據集訓練了不同的模型,然后將預測系數與實際系數進行了比較。 我將鼓勵大家采用相同的方法,并比較預測指標。 使用離群數據集訓練的不同模型的R2得分,均方誤差(MSE),RMSE。
Hint — You may be surprised to see the R2 (coefficient of determination) regression score function for the models in comparison to the coefficient prediction accuracy we have seen in this article. In case, you stumble upon on any point then, feel free to reach out to me.
提示 —與我們在本文中看到的系數預測精度相比,您可能會驚訝地看到模型的R2(確定系數)回歸得分函數。 萬一您偶然發現了任何東西,請隨時與我聯系。
Key Takeaway
重點介紹
As mentioned in my earlier article and keep stressing that main focus for us machine learning practitioners are to consider the data, prediction objective, algorithms strengths and limitations before starting the modelling. Every additional minute we spend in understanding the training data directly translates into prediction accuracy with the right algorithms. We don’t want to use a hammer to unscrew and screwdriver to nail in the wall.
正如我在前一篇文章中提到的,并繼續強調,機器學習從業者的主要關注點是在開始建模之前要考慮數據,預測目標,算法優勢和局限性。 我們花費在理解訓練數據上的每一分鐘都可以通過正確的算法直接轉化為預測準確性。 我們不想用錘子擰開而用螺絲刀釘在墻上。
If you want to learn more on a structured approach to identifying the right independent variables for Machine Learning Supervised Algorithms then please refer my article on this topic.
如果您想了解更多有關識別機器學習監督算法的正確自變量的結構化方法的信息,請參閱我關于此主題的文章 。
"""Full Code"""from sklearn.linear_model import HuberRegressor, LinearRegression ,Ridge ,SGDRegressor, ElasticNet, PassiveAggressiveRegressorfrom sklearn.svm import LinearSVR
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as npRawData=pd.read_excel("Outlier Regression.xlsx")
Data=RawData.drop(["Target"], axis=1)sns.heatmap(Data.corr(), cmap="YlGnBu", annot=True)
plt.show()sns.boxplot(data=Data, orient="h",palette="Set2")
plt.show()print (Data.info())print(Data.describe())imputer = SimpleImputer(strategy="mean")
imputer.fit(Data)
TransformData = imputer.transform(Data)
X=pd.DataFrame(TransformData, columns=Data.columns)
print (X.info())vif = pd.DataFrame()
vif["features"] = X.columns
vif["vif_Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
y=RawData["Target"].copy()Huber = HuberRegressor()
Linear = LinearRegression()
SGD= SGDRegressor()
Ridge=Ridge()
SVR=LinearSVR()
Elastic=ElasticNet(random_state=0)
PassiveAggressiveRegressor= PassiveAggressiveRegressor()estimators = [Linear,SGD,SVR,Huber, Ridge, Elastic,PassiveAggressiveRegressor]for i in estimators:
reg= i.fit(X,y)
print(str(i)+" Coefficients:", np.round(i.coef_,2))
print("**************************")
翻譯自: https://towardsdatascience.com/are-outliers-ruining-your-machine-learning-predictions-search-for-an-optimal-solution-c81313e994ca
機器學習 異常值檢測
總結
以上是生活随笔為你收集整理的机器学习 异常值检测_异常值是否会破坏您的机器学习预测? 寻找最佳解决方案的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 吉利汽车CEO称行业乱象丛生 吉利做的事
- 下一篇: yolov3算法优点缺点_优点缺点