数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理
數(shù)據(jù)預(yù)處理 泰坦尼克號
什么是數(shù)據(jù)預(yù)處理? (What is Data Pre-Processing?)
We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.
從我的上一篇博客中我們知道,數(shù)據(jù)預(yù)處理是一種數(shù)據(jù)挖掘技術(shù),它涉及將原始數(shù)據(jù)轉(zhuǎn)換為可理解的格式。 實際數(shù)據(jù)通常不完整,不一致和/或缺少某些行為或趨勢,并且可能包含許多錯誤。 數(shù)據(jù)預(yù)處理是解決此類問題的一種行之有效的方法。 數(shù)據(jù)預(yù)處理將準(zhǔn)備原始數(shù)據(jù)以進行進一步處理。
So in this blog we will learn about the implementation of data pre-processing on a data set. I have decided to do my implementation using the Titanic data set, which I have downloaded from Kaggle. Here is the link to get this dataset- https://www.kaggle.com/c/titanic-gettingStarted/data
因此,在本博客中,我們將學(xué)習(xí)在數(shù)據(jù)集上實施數(shù)據(jù)預(yù)處理的方法。 我決定使用我從Kaggle下載的Titanic數(shù)據(jù)集進行實施。 這是獲取此數(shù)據(jù)集的鏈接-https : //www.kaggle.com/c/titanic-gettingStarted/data
Note- Kaggle gives 2 datasets, the train and the test dataset, so we will use both of them in this process.
注意 -Kaggle提供了2個數(shù)據(jù)集,即訓(xùn)練和測試數(shù)據(jù)集,因此在此過程中我們將同時使用它們。
預(yù)期的結(jié)果是什么? (What is the expected outcome?)
The Titanic shipwreck was a massive disaster, so we will implement data pre- processing on this data set to know the number of survivors and their details.
泰坦尼克號沉船事故是一場巨大的災(zāi)難,因此我們將對該數(shù)據(jù)集進行數(shù)據(jù)預(yù)處理,以了解幸存者的人數(shù)及其詳細(xì)信息。
I will show you how to apply data preprocessing techniques on the Titanic dataset, with a tinge of my own ideas into this.
我將向您展示如何在Titanic數(shù)據(jù)集上應(yīng)用數(shù)據(jù)預(yù)處理技術(shù),并結(jié)合我自己的想法。
So let’s get started…
因此,讓我們開始吧...
導(dǎo)入所有重要的庫 (Importing all the important libraries)
Firstly after loading the data sets in our system, we will import the libraries that are needed to perform the functions. In my case I imported NumPy, Pandas and Matplot libraries.
首先,在將數(shù)據(jù)集加載到我們的系統(tǒng)中之后,我們將導(dǎo)入執(zhí)行功能所需的庫。 就我而言,我導(dǎo)入了NumPy,Pandas和Matplot庫。
#importing librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd
#importing librarys將numpy導(dǎo)入為npimport matplotlib.pyplot作為pltimport熊貓作為pd
使用Pandas導(dǎo)入數(shù)據(jù)集 (Importing dataset using Pandas)
To work on the data, you can either load the CSV in excel software or in pandas. So I will load the CSV data in pandas. Then we will also use a function to view that data in the Jupyter notebook.
要處理數(shù)據(jù),可以在excel軟件或熊貓中加載CSV。 因此,我將在熊貓中加載CSV數(shù)據(jù)。 然后,我們還將使用一個函數(shù)在Jupyter筆記本中查看該數(shù)據(jù)。
#importing dataset using pandasdf = pd.read_csv(r’C:\Users\KIIT\Desktop\Internity Internship\Day 4 task\train.csv’)df.shapedf.head()
#使用pandasdf = pd.read_csv(r'C:\ Users \ KIIT \ Desktop \ Internal Internship \ Day 4 task \ train.csv')df.shapedf.head()導(dǎo)入數(shù)據(jù)集
#Taking a look at the data format belowdf.info()
#看看下面的數(shù)據(jù)格式df.info()
Let’s take a look at the data output that we get from the above code snippets :
讓我們看一下從以上代碼片段獲得的數(shù)據(jù)輸出:
If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values.
如果您仔細(xì)觀察以上熊貓的摘要,則總共有891行,“年齡”僅顯示714行(均值缺失),上船(缺失2幅)以及機艙缺失很多。 對象數(shù)據(jù)類型是非數(shù)字的,因此我們必須找到一種將其編碼為數(shù)值的方法。
查看特定數(shù)據(jù)集中的列 (Viewing the columns in the particular dataset)
We use a function to view all the columns that are being used in this dataset for a better reference of the kind of data that we are working on.
我們使用一個函數(shù)來查看此數(shù)據(jù)集中正在使用的所有列,以更好地參考我們正在處理的數(shù)據(jù)類型。
#Taking a look at all the columns in the data setprint(df.columns)
#查看數(shù)據(jù)setprint(df.columns)中的所有列
定義獨立和相關(guān)數(shù)據(jù)的值 (Defining values for independent and dependent data)
Here we will declare the values of X and y for our independent and dependent data.
在這里,我們將為我們的獨立數(shù)據(jù)和相關(guān)數(shù)據(jù)聲明X和y的值。
#independet dataX = df.iloc[:, 1:-1].values#dependent datay = df.iloc[:, -1].values
#independet dataX = df.iloc [:, 1:-1] .values#dependent datay = df.iloc [:, -1] .values
刪除無用的列 (Dropping Columns which are not useful)
Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.
讓我們嘗試刪除一些對我們的機器學(xué)習(xí)模型貢獻不大的列,例如名稱,票務(wù),機艙等。
So we will drop 3 columns and then we will take a look at the newly generated data.
因此,我們將刪除3列,然后看一下新生成的數(shù)據(jù)。
#Dropping Columns which are not usefull, so we drop 3 of them here according to our conveniencecols = [‘Name’, ‘Ticket’, ‘Cabin’]df = df.drop(cols, axis=1)
#刪除沒有用的列,因此我們根據(jù)我們的便便性將其中的3個放置在此處colcols = ['Name','Ticket','Cabin'] df = df.drop(cols,axis = 1)
#Taking a look at the newly formed data format belowdf.info()
#在下面的df.info()中查看新形成的數(shù)據(jù)格式
刪除缺少值的行 (Dropping rows having missing values)
Next if we want we can drop all rows in the data that has missing values (NaN). You can do it like the code shows-
接下來,如果需要,我們可以刪除數(shù)據(jù)中所有缺少值(NaN)的行。 您可以像代碼所示那樣進行操作-
#Dropping the rows that have missing valuesdf = df.dropna()df.info()
#刪除缺少值的行df = df.dropna()df.info()
刪除缺少值的行的問題 (Problem with dropping rows having missing values)
After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can. We will see it later.
刪除缺少值的行后,我們發(fā)現(xiàn)數(shù)據(jù)集從891減少到712行,這意味著我們在浪費數(shù)據(jù) 。 機器學(xué)習(xí)模型需要用于訓(xùn)練的數(shù)據(jù)才能表現(xiàn)良好。 因此,我們保留并盡可能多地利用數(shù)據(jù)。 我們稍后會看到。
創(chuàng)建虛擬變量 (Creating Dummy Variables)
Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.
現(xiàn)在,我們將Pclass,Sex,Embeded轉(zhuǎn)換為熊貓中的列,并在轉(zhuǎn)換后將其刪除。
#Creating Dummy Variablesdummies = []cols = [‘Pclass’, ‘Sex’, ‘Embarked’]for col in cols:dummies.append(pd.get_dummies(df[col]))titanic_dummies = pd.concat(dummies, axis=1)
#為col中的col創(chuàng)建虛擬變量dummies = [] cols = ['Pclass','Sex','Embarked'] cols:dummies.append(pd.get_dummies(df [col]))titanic_dummies = pd.concat(Dummies,axis = 1)
So on seeing the information we know we have 8 columns transformed to columns where 1,2,3 represents passenger class.
因此,在查看信息后,我們知道我們將8列轉(zhuǎn)換為其中1,2,3代表乘客艙位的列。
And finally we concatenate to the original data frame column wise.
最后,我們將原始數(shù)據(jù)幀按列連接。
#Combining the original datasetdf = pd.concat((df,titanic_dummies), axis=1)
#合并原始數(shù)據(jù)集df = pd.concat((df,titanic_dummies),axis = 1)
Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the data frame and now take a look at the new data set.
現(xiàn)在,我們將Pclass,Sex,Embarked值轉(zhuǎn)換為列,然后從數(shù)據(jù)框中刪除了冗余的相同列,現(xiàn)在來看一下新的數(shù)據(jù)集。
df = df.drop([‘Pclass’, ‘Sex’, ‘Embarked’], axis=1)
df = df.drop(['Pclass','Sex','Embarked'],axis = 1)
df.info()
df.info()
照顧丟失的數(shù)據(jù) (Taking Care of Missing Data)
All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a interpolate() function that will replace all the missing NaNs to interpolated values.
一切都很好,除了年齡,它有很多缺失的值。 讓我們計算所有年齡的中位數(shù)或interpolate()并填充那些缺失的年齡值。 熊貓有一個interpolate()函數(shù),它將所有缺少的NaN替換為插值。
#Taking care of the missing data by interpolate functiondf[‘Age’] = df[‘Age’].interpolate()
#通過插值函數(shù)df ['Age'] = df ['Age']。interpolate()處理丟失的數(shù)據(jù)
df.info()
df.info()
Now lets observe the data columns. Notice age which is interpolated now with imputed new values.
現(xiàn)在讓我們觀察數(shù)據(jù)列。 注意使用新的插值插入的年齡。
將數(shù)據(jù)幀轉(zhuǎn)換為NumPy (Converting the data frame to NumPy)
Now that we have converted all the data to numeric, its time for preparing the data for machine learning models. This is where scikit and numpy come into play:
現(xiàn)在,我們已將所有數(shù)據(jù)轉(zhuǎn)換為數(shù)字,這是為機器學(xué)習(xí)模型準(zhǔn)備數(shù)據(jù)的時間。 這是scikit和numpy發(fā)揮作用的地方:
X = Input set with 14 attributesy = Small y Output, in this case ‘Survived’
X =具有14個屬性的輸入集y =小y輸出,在這種情況下為“生存”
Now we convert our dataframe from pandas to numpy and we assign input and output.
現(xiàn)在,我們將數(shù)據(jù)幀從熊貓轉(zhuǎn)換為numpy,并分配輸入和輸出。
#using the concept of survived vlues, we conver and view the dataframe to NumPyX = df.valuesy = df[‘Survived’].values
#使用幸存的虛擬詞的概念,我們將數(shù)據(jù)幀收斂并查看為NumPyX = df.valuesy = df ['Survived']。values
X = np.delete(X, 1, axis=1)
X = np.delete(X,1,軸= 1)
將數(shù)據(jù)集分為訓(xùn)練集和測試集 (Dividing data set into training set and test set)
Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit model_selection like in code and the 4 print functions after that-
現(xiàn)在我們已經(jīng)準(zhǔn)備好使用X和y,讓我們使用scikit model_selection像代碼中那樣拆分70%Training和30%Test Set的數(shù)據(jù)集,然后使用4個打印功能-
#Dividing data set into training set and test set (Most important step)from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#從sklearn.model_selection導(dǎo)入數(shù)據(jù)集分為訓(xùn)練集和測試集(最重要的步驟)import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 0)
功能縮放 (Feature Scaling)
Feature Scaling is an important step of data preprocessing. Feature Scaling makes all data in such way that they lie in same scale usually -3 to +3.
特征縮放是數(shù)據(jù)預(yù)處理的重要步驟。 Feature Scaling使所有數(shù)據(jù)處于相同的比例,通常為-3至+3。
In out data set some field have small value and some field have large value. If we apply out machine learning model without feature scaling then prediction our model have high cost(It does because small value are dominated by large value). So before apply model we have to perform feature scaling.
在輸出數(shù)據(jù)集中,某些字段的值較小,而某些字段的值較大。 如果我們在沒有特征縮放的情況下應(yīng)用機器學(xué)習(xí)模型,那么預(yù)測我們的模型將具有較高的成本(這是因為小值由大值主導(dǎo))。 因此,在應(yīng)用模型之前,我們必須執(zhí)行特征縮放。
We can perform feature scaling in two ways.
我們可以通過兩種方式執(zhí)行特征縮放。
I-:Standardizaion x=(x-mean(X))/standard deviation(X)
I-:標(biāo)準(zhǔn)化x =(x均值(X))/標(biāo)準(zhǔn)差(X)
II-:Normalization-: x=(x-min(X))/(max(X)-min(X))
II-:歸一化-:x =(x-min(X))/(max(X)-min(X))
#Using the concept of feature scalingfrom sklearn.preprocessing import StandardScalersc = StandardScaler()X_train[:,3:] = sc.fit_transform(X_train[:,3:])X_test[:,3:] = sc.transform(X_test[:,3:])
#使用sklearn.preprocessing import的特征縮放概念,StandardScalersc = StandardScaler()X_train [:,3:] = sc.fit_transform(X_train [:,3:])X_test [:,3:] = sc.transform(X_test [ :,3:])
That’s all for today guys!
今天就這些了!
This is the final outcome of the whole process. For more of such blogs, stay tuned!
這是整個過程的最終結(jié)果。 有關(guān)此類博客的更多信息,請繼續(xù)關(guān)注!
翻譯自: https://medium.com/all-about-machine-learning/understanding-data-preprocessing-taking-the-titanic-dataset-ebb78de162e0
數(shù)據(jù)預(yù)處理 泰坦尼克號
總結(jié)
以上是生活随笔為你收集整理的数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到死人复活是什么码
- 下一篇: vc6.0 绘制散点图_vc有关散点图的