日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理

發(fā)布時間:2023/11/29 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)預(yù)處理 泰坦尼克號

什么是數(shù)據(jù)預(yù)處理? (What is Data Pre-Processing?)

We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

從我的上一篇博客中我們知道,數(shù)據(jù)預(yù)處理是一種數(shù)據(jù)挖掘技術(shù),它涉及將原始數(shù)據(jù)轉(zhuǎn)換為可理解的格式。 實際數(shù)據(jù)通常不完整,不一致和/或缺少某些行為或趨勢,并且可能包含許多錯誤。 數(shù)據(jù)預(yù)處理是解決此類問題的一種行之有效的方法。 數(shù)據(jù)預(yù)處理將準(zhǔn)備原始數(shù)據(jù)以進行進一步處理。

So in this blog we will learn about the implementation of data pre-processing on a data set. I have decided to do my implementation using the Titanic data set, which I have downloaded from Kaggle. Here is the link to get this dataset- https://www.kaggle.com/c/titanic-gettingStarted/data

因此,在本博客中,我們將學(xué)習(xí)在數(shù)據(jù)集上實施數(shù)據(jù)預(yù)處理的方法。 我決定使用我從Kaggle下載的Titanic數(shù)據(jù)集進行實施。 這是獲取此數(shù)據(jù)集的鏈接-https : //www.kaggle.com/c/titanic-gettingStarted/data

Note- Kaggle gives 2 datasets, the train and the test dataset, so we will use both of them in this process.

注意 -Kaggle提供了2個數(shù)據(jù)集,即訓(xùn)練和測試數(shù)據(jù)集,因此在此過程中我們將同時使用它們。

預(yù)期的結(jié)果是什么? (What is the expected outcome?)

The Titanic shipwreck was a massive disaster, so we will implement data pre- processing on this data set to know the number of survivors and their details.

泰坦尼克號沉船事故是一場巨大的災(zāi)難,因此我們將對該數(shù)據(jù)集進行數(shù)據(jù)預(yù)處理,以了解幸存者的人數(shù)及其詳細(xì)信息。

I will show you how to apply data preprocessing techniques on the Titanic dataset, with a tinge of my own ideas into this.

我將向您展示如何在Titanic數(shù)據(jù)集上應(yīng)用數(shù)據(jù)預(yù)處理技術(shù),并結(jié)合我自己的想法。

So let’s get started…

因此,讓我們開始吧...

導(dǎo)入所有重要的庫 (Importing all the important libraries)

Firstly after loading the data sets in our system, we will import the libraries that are needed to perform the functions. In my case I imported NumPy, Pandas and Matplot libraries.

首先,在將數(shù)據(jù)集加載到我們的系統(tǒng)中之后,我們將導(dǎo)入執(zhí)行功能所需的庫。 就我而言,我導(dǎo)入了NumPy,Pandas和Matplot庫。

#importing librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd

#importing librarys將numpy導(dǎo)入為npimport matplotlib.pyplot作為pltimport熊貓作為pd

使用Pandas導(dǎo)入數(shù)據(jù)集 (Importing dataset using Pandas)

To work on the data, you can either load the CSV in excel software or in pandas. So I will load the CSV data in pandas. Then we will also use a function to view that data in the Jupyter notebook.

要處理數(shù)據(jù),可以在excel軟件或熊貓中加載CSV。 因此,我將在熊貓中加載CSV數(shù)據(jù)。 然后,我們還將使用一個函數(shù)在Jupyter筆記本中查看該數(shù)據(jù)。

#importing dataset using pandasdf = pd.read_csv(r’C:\Users\KIIT\Desktop\Internity Internship\Day 4 task\train.csv’)df.shapedf.head()

#使用pandasdf = pd.read_csv(r'C:\ Users \ KIIT \ Desktop \ Internal Internship \ Day 4 task \ train.csv')df.shapedf.head()導(dǎo)入數(shù)據(jù)集

#Taking a look at the data format belowdf.info()

#看看下面的數(shù)據(jù)格式df.info()

Let’s take a look at the data output that we get from the above code snippets :

讓我們看一下從以上代碼片段獲得的數(shù)據(jù)輸出:

If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values.

如果您仔細(xì)觀察以上熊貓的摘要,則總共有891行,“年齡”僅顯示714行(均值缺失),上船(缺失2幅)以及機艙缺失很多。 對象數(shù)據(jù)類型是非數(shù)字的,因此我們必須找到一種將其編碼為數(shù)值的方法。

查看特定數(shù)據(jù)集中的列 (Viewing the columns in the particular dataset)

We use a function to view all the columns that are being used in this dataset for a better reference of the kind of data that we are working on.

我們使用一個函數(shù)來查看此數(shù)據(jù)集中正在使用的所有列,以更好地參考我們正在處理的數(shù)據(jù)類型。

#Taking a look at all the columns in the data setprint(df.columns)

#查看數(shù)據(jù)setprint(df.columns)中的所有列

定義獨立和相關(guān)數(shù)據(jù)的值 (Defining values for independent and dependent data)

Here we will declare the values of X and y for our independent and dependent data.

在這里,我們將為我們的獨立數(shù)據(jù)和相關(guān)數(shù)據(jù)聲明X和y的值。

#independet dataX = df.iloc[:, 1:-1].values#dependent datay = df.iloc[:, -1].values

#independet dataX = df.iloc [:, 1:-1] .values#dependent datay = df.iloc [:, -1] .values

刪除無用的列 (Dropping Columns which are not useful)

Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

讓我們嘗試刪除一些對我們的機器學(xué)習(xí)模型貢獻不大的列,例如名稱,票務(wù),機艙等。

So we will drop 3 columns and then we will take a look at the newly generated data.

因此,我們將刪除3列,然后看一下新生成的數(shù)據(jù)。

#Dropping Columns which are not usefull, so we drop 3 of them here according to our conveniencecols = [‘Name’, ‘Ticket’, ‘Cabin’]df = df.drop(cols, axis=1)

#刪除沒有用的列,因此我們根據(jù)我們的便便性將其中的3個放置在此處colcols = ['Name','Ticket','Cabin'] df = df.drop(cols,axis = 1)

#Taking a look at the newly formed data format belowdf.info()

#在下面的df.info()中查看新形成的數(shù)據(jù)格式

刪除缺少值的行 (Dropping rows having missing values)

Next if we want we can drop all rows in the data that has missing values (NaN). You can do it like the code shows-

接下來,如果需要,我們可以刪除數(shù)據(jù)中所有缺少值(NaN)的行。 您可以像代碼所示那樣進行操作-

#Dropping the rows that have missing valuesdf = df.dropna()df.info()

#刪除缺少值的行df = df.dropna()df.info()

刪除缺少值的行的問題 (Problem with dropping rows having missing values)

After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can. We will see it later.

刪除缺少值的行后,我們發(fā)現(xiàn)數(shù)據(jù)集從891減少到712行,這意味著我們在浪費數(shù)據(jù) 。 機器學(xué)習(xí)模型需要用于訓(xùn)練的數(shù)據(jù)才能表現(xiàn)良好。 因此,我們保留并盡可能多地利用數(shù)據(jù)。 我們稍后會看到。

創(chuàng)建虛擬變量 (Creating Dummy Variables)

Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

現(xiàn)在,我們將Pclass,Sex,Embeded轉(zhuǎn)換為熊貓中的列,并在轉(zhuǎn)換后將其刪除。

#Creating Dummy Variablesdummies = []cols = [‘Pclass’, ‘Sex’, ‘Embarked’]for col in cols:dummies.append(pd.get_dummies(df[col]))titanic_dummies = pd.concat(dummies, axis=1)

#為col中的col創(chuàng)建虛擬變量dummies = [] cols = ['Pclass','Sex','Embarked'] cols:dummies.append(pd.get_dummies(df [col]))titanic_dummies = pd.concat(Dummies,axis = 1)

So on seeing the information we know we have 8 columns transformed to columns where 1,2,3 represents passenger class.

因此,在查看信息后,我們知道我們將8列轉(zhuǎn)換為其中1,2,3代表乘客艙位的列。

And finally we concatenate to the original data frame column wise.

最后,我們將原始數(shù)據(jù)幀按列連接。

#Combining the original datasetdf = pd.concat((df,titanic_dummies), axis=1)

#合并原始數(shù)據(jù)集df = pd.concat((df,titanic_dummies),axis = 1)

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the data frame and now take a look at the new data set.

現(xiàn)在,我們將Pclass,Sex,Embarked值轉(zhuǎn)換為列,然后從數(shù)據(jù)框中刪除了冗余的相同列,現(xiàn)在來看一下新的數(shù)據(jù)集。

df = df.drop([‘Pclass’, ‘Sex’, ‘Embarked’], axis=1)

df = df.drop(['Pclass','Sex','Embarked'],axis = 1)

df.info()

df.info()

照顧丟失的數(shù)據(jù) (Taking Care of Missing Data)

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a interpolate() function that will replace all the missing NaNs to interpolated values.

一切都很好,除了年齡,它有很多缺失的值。 讓我們計算所有年齡的中位數(shù)或interpolate()并填充那些缺失的年齡值。 熊貓有一個interpolate()函數(shù),它將所有缺少的NaN替換為插值。

#Taking care of the missing data by interpolate functiondf[‘Age’] = df[‘Age’].interpolate()

#通過插值函數(shù)df ['Age'] = df ['Age']。interpolate()處理丟失的數(shù)據(jù)

df.info()

df.info()

Now lets observe the data columns. Notice age which is interpolated now with imputed new values.

現(xiàn)在讓我們觀察數(shù)據(jù)列。 注意使用新的插值插入的年齡。

將數(shù)據(jù)幀轉(zhuǎn)換為NumPy (Converting the data frame to NumPy)

Now that we have converted all the data to numeric, its time for preparing the data for machine learning models. This is where scikit and numpy come into play:

現(xiàn)在,我們已將所有數(shù)據(jù)轉(zhuǎn)換為數(shù)字,這是為機器學(xué)習(xí)模型準(zhǔn)備數(shù)據(jù)的時間。 這是scikit和numpy發(fā)揮作用的地方:

X = Input set with 14 attributesy = Small y Output, in this case ‘Survived’

X =具有14個屬性的輸入集y =小y輸出,在這種情況下為“生存”

Now we convert our dataframe from pandas to numpy and we assign input and output.

現(xiàn)在,我們將數(shù)據(jù)幀從熊貓轉(zhuǎn)換為numpy,并分配輸入和輸出。

#using the concept of survived vlues, we conver and view the dataframe to NumPyX = df.valuesy = df[‘Survived’].values

#使用幸存的虛擬詞的概念,我們將數(shù)據(jù)幀收斂并查看為NumPyX = df.valuesy = df ['Survived']。values

X = np.delete(X, 1, axis=1)

X = np.delete(X,1,軸= 1)

將數(shù)據(jù)集分為訓(xùn)練集和測試集 (Dividing data set into training set and test set)

Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit model_selection like in code and the 4 print functions after that-

現(xiàn)在我們已經(jīng)準(zhǔn)備好使用X和y,讓我們使用scikit model_selection像代碼中那樣拆分70%Training和30%Test Set的數(shù)據(jù)集,然后使用4個打印功能-

#Dividing data set into training set and test set (Most important step)from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#從sklearn.model_selection導(dǎo)入數(shù)據(jù)集分為訓(xùn)練集和測試集(最重要的步驟)import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 0)

功能縮放 (Feature Scaling)

Feature Scaling is an important step of data preprocessing. Feature Scaling makes all data in such way that they lie in same scale usually -3 to +3.

特征縮放是數(shù)據(jù)預(yù)處理的重要步驟。 Feature Scaling使所有數(shù)據(jù)處于相同的比例,通常為-3至+3。

In out data set some field have small value and some field have large value. If we apply out machine learning model without feature scaling then prediction our model have high cost(It does because small value are dominated by large value). So before apply model we have to perform feature scaling.

在輸出數(shù)據(jù)集中,某些字段的值較小,而某些字段的值較大。 如果我們在沒有特征縮放的情況下應(yīng)用機器學(xué)習(xí)模型,那么預(yù)測我們的模型將具有較高的成本(這是因為小值由大值主導(dǎo))。 因此,在應(yīng)用模型之前,我們必須執(zhí)行特征縮放。

We can perform feature scaling in two ways.

我們可以通過兩種方式執(zhí)行特征縮放。

I-:Standardizaion x=(x-mean(X))/standard deviation(X)

I-:標(biāo)準(zhǔn)化x =(x均值(X))/標(biāo)準(zhǔn)差(X)

II-:Normalization-: x=(x-min(X))/(max(X)-min(X))

II-:歸一化-:x =(x-min(X))/(max(X)-min(X))

#Using the concept of feature scalingfrom sklearn.preprocessing import StandardScalersc = StandardScaler()X_train[:,3:] = sc.fit_transform(X_train[:,3:])X_test[:,3:] = sc.transform(X_test[:,3:])

#使用sklearn.preprocessing import的特征縮放概念,StandardScalersc = StandardScaler()X_train [:,3:] = sc.fit_transform(X_train [:,3:])X_test [:,3:] = sc.transform(X_test [ :,3:])

That’s all for today guys!

今天就這些了!

This is the final outcome of the whole process. For more of such blogs, stay tuned!

這是整個過程的最終結(jié)果。 有關(guān)此類博客的更多信息,請繼續(xù)關(guān)注!

翻譯自: https://medium.com/all-about-machine-learning/understanding-data-preprocessing-taking-the-titanic-dataset-ebb78de162e0

數(shù)據(jù)預(yù)處理 泰坦尼克號

總結(jié)

以上是生活随笔為你收集整理的数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 亚洲成人黄 | 俺去俺来也在线www色官网 | 国产欧美精品 | 毛片亚洲av无码精品国产午夜 | 91porn九色 | 天天插天天狠天天透 | 91久久精品国产91久久性色tv | 国产成人a∨ | 少妇av在线播放 | 欧美三级中文字幕 | 制服丝袜快播 | 亚洲欧美天堂网 | 狠狠操狠狠 | 免费成人在线观看 | 久草视频在线播放 | 草草网址 | 色久天| 欧美三级久久久 | 性生活视频网站 | 日本不卡1 | 色婷婷综合在线 | 亚洲免费在线视频观看 | 亚洲一区二区三区高清在线 | 国产成人宗合 | 精品久久久无码中文字幕边打电话 | 久久久久久久久久免费视频 | 在线观看欧美亚洲 | 中文字幕精品三区 | 日韩精品免费一区二区三区竹菊 | 二区不卡 | 97综合网| 中文字幕不卡av | h视频网站在线观看 | 日本欧美亚洲 | 日本成人三级电影 | 尤果网福利视频在线观看 | 嫩草嫩草嫩草嫩草嫩草嫩草 | 久久精品一本 | 人妖天堂狠狠ts人妖天堂狠狠 | 精品乱子伦一区二区三区 | 半推半就一ⅹ99av | 小泽玛利亚一区二区三区视频 | 久久官网 | 男男做爰猛烈叫床爽爽小说 | h网址在线观看 | 91视频免费在线 | 少妇日b | 天天干免费视频 | 河北彩花中文字幕 | 18国产免费视频 | www.亚洲高清 | 天天cao在线| 乳揉みま痴汉4在线播放 | www成人网| 今天高清视频在线观看视频 | 久久久久无码精品国产 | 欧美精品一区二区三区三州 | 日韩aa视频 | 欧美性久久久久 | 欧美成人精品一区二区 | 日本天堂免费a | 中文在线不卡视频 | 青青草伊人 | 日本三级小视频 | 精品一区在线观看视频 | 99久久久久久久久久 | 国产午夜毛片 | 无码一区二区三区在线观看 | 国产99在线 | 狠狠操狠狠爱 | 一区二区三区高清在线观看 | 夜夜伊人 | 成人av手机在线 | 成人影 | 午夜片在线观看 | 最新国产露脸在线观看 | 四虎新网站 | 男人天堂手机在线 | 精品国产乱码久久久久久久软件 | 婷婷六月丁 | 亚欧乱色| 91色啪| 久久久国产成人 | 亚洲成人一区在线 | 国模无码一区二区三区 | 97视频免费 | 奇米四色在线视频 | www.亚洲色图| 少妇性生活视频 | av成人资源 | 日韩欧美中文字幕在线播放 | 亚洲成人黄色av | 变态另类一区二区 | 成人做爰66片免费看网站 | 青青草成人在线观看 | 嫩草网站在线观看 | 和漂亮岳做爰3中文字幕 | 亚洲精品va | 成人av久久|