泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)
泰坦尼克數(shù)據(jù)集預(yù)測分析
Imagine your group of friends have decided to spend the vacations by travelling to an amazing destination. And you have been given the responsibility to find one. Interesting? However interesting it may seem, choosing a single location for which everyone agrees is still a hectic task. There are various factors that need to be considered while choosing the location. The cost of the travel journey to the location should fit everyone's pockets. There should be proper accommodation options. The remoteness of the location, activities available, the best time to visit the locations and so on.
想象一下,您的一群朋友決定通過前往一個(gè)奇妙的目的地來度過假期。 并且您有責(zé)任找到一個(gè)。 有趣? 無論看起來多么有趣,選擇每個(gè)人都同意的單一位置仍然是一項(xiàng)繁重的任務(wù)。 選擇位置時(shí),需要考慮多種因素。 前往該地點(diǎn)的旅行費(fèi)用應(yīng)適合每個(gè)人的腰包。 應(yīng)該有適當(dāng)?shù)淖∷捱x擇。 位置的偏遠(yuǎn)性,可用的活動(dòng) , 訪問位置的最佳時(shí)間等。
To get the information about all these factors you have to search on the internet and get information from various sources. After getting all these information, you then have to compare and find a right trade-off between the factors of these locations. The very same activity of gathering, understanding and comparing the data(can be in a visual representation) to help make better decisions is known as Exploratory Data Analysis.
要獲取有關(guān)所有這些因素的信息,您必須在Internet上搜索并從各種來源獲取信息。 在獲得所有這些信息之后,您必須比較并在這些位置的因素之間找到正確的權(quán)衡。 收集 , 理解和比較數(shù)據(jù)(可以以視覺方式表示)以幫助做出更好的決策的相同活動(dòng)被稱為探索性數(shù)據(jù)分析 。
In this article we’ll be Exploring the data of the legendary — ‘Titanic’. We will be using ‘Titanic: Machine Learning from Disaster’ data-set on Kaggle to dive deep into it. The main objective, however, is to predict the survival of the passengers based on the attributes given, here, we’ll be exploring the data-set to find the Hidden Story which was covered along with Sinking of the Titanic. We’ll be unraveling some amazing mysteries behind the sinking of Titanic which you hardly might have heard of. So, get in your detective hat and magnifying glass, ‘coz we’ll be exploring History’s one of the most interesting data till date!
在本文中,我們將探索傳說中的“泰坦尼克號”數(shù)據(jù)。 我們將使用Kaggle上的“ 泰坦尼克號:從災(zāi)難中學(xué)習(xí)機(jī)器 ”數(shù)據(jù)集來深入研究它。 但是,主要目的是根據(jù)給定的屬性預(yù)測乘客的生存,在這里,我們將探索數(shù)據(jù)集,以查找《泰坦尼克號沉沒》所涵蓋的隱藏故事 。 在泰坦尼克號沉沒的背后,我們將揭開一些令人難以置信的神秘面紗 ,您可能幾乎沒有聽說過。 因此,戴上您的偵探帽和放大鏡,因?yàn)榈侥壳盀橹?#xff0c;我們將探索歷史上最有趣的數(shù)據(jù)之一!
The basic requirements for this project will be a basic understanding about the Python language along with some basic plotting library of Matplotlib, Seaborn etc. If you don’t know any of these, it’s completely alright ‘coz at the end of this article you will be having a basic understanding about it. I’ll be recommending to use ‘Jupyter notebook’ or the simpler and the better option- ‘Google Colab’. Know how to setup your Google Colab by following the simple steps mentioned in this link — https://www.geeksforgeeks.org/how-to-use-google-colab/. After opening and setting up the Google Colab, it’s time to bring your inner Data Scientist out :)
該項(xiàng)目的基本要求將是對Python語言以及Matplotlib,Seaborn等基本繪圖庫的基本了解。如果您不了解其中任何一個(gè),那么在本文結(jié)尾處完全可以使用'coz',對它有基本的了解。 我建議使用“ Jupyter筆記本”或更簡單,更好的選擇-“ Google Colab”。 遵循此鏈接中提到的簡單步驟,了解如何設(shè)置Google Colab — https://www.geeksforgeeks.org/how-to-use-google-colab/ 。 打開并設(shè)置Google Colab之后,是時(shí)候?qū)⒛膬?nèi)部數(shù)據(jù)科學(xué)家?guī)С鰜砹?)
I have also provided the link to the Google Colab Notebook having all these code.
我還提供了包含所有這些代碼的Google Colab Notebook的鏈接。
To import the data from Kaggle into Google Colab follow these steps—
要將數(shù)據(jù)從Kaggle導(dǎo)入Go??ogle Colab,請按照以下步驟操作:
選擇您已下載的kaggle.json文件 (Choose the kaggle.json file that you have downloaded)
創(chuàng)建名為kaggle的目錄,并將kaggle.json文件復(fù)制到其中。 (Make directory named kaggle and copy kaggle.json file into it.)
更改文件的權(quán)限并下載數(shù)據(jù)集 (Change the permissions of the file and download the data-set)
Make sure the Drive is Mounted and appropriate folders are created . Here in my case folders Projects>datasets were created where we’ll be moving the downloaded data-set to avoid importing from kaggle repeatedly.
確保已安裝驅(qū)動(dòng)器并創(chuàng)建了適當(dāng)?shù)奈募A。 在我的案例中,這里創(chuàng)建了Projects> datasets文件夾,我們將在其中移動(dòng)下載的數(shù)據(jù)集,以避免重復(fù)從kaggle導(dǎo)入。
Congratulations, you’ve successfully imported the data-set from Kaggle and stored it into your Google Drive. Next time whenever we need the data-set we can do so by simply copying the path of the file and loading it.
恭喜,您已成功從Kaggle導(dǎo)入數(shù)據(jù)集并將其存儲(chǔ)到Google云端硬盤中。 下次,只要我們需要數(shù)據(jù)集,我們都可以通過簡單地復(fù)制文件的路徑并加載它來實(shí)現(xiàn)。
Now, let’s start by importing the required libraries
現(xiàn)在,讓我們開始導(dǎo)入所需的庫
Before jumping into EDA let’s first load the data-set and have quick glance on it. Below are the libraries that we may require.
在進(jìn)入EDA之前,我們首先加載數(shù)據(jù)集并快速瀏覽一下。 以下是我們可能需要的庫。
The data can be loaded in the format of Pandas’ Dataframe as follows and ‘data.shape’ print the dimensions of the data where 891 represents the number of records and 12 represents the number of attributes. Now the path to ‘train.csv’ may vary as per your file’s location. You may directly copy and paste the path to the file in ‘Files’ section right below the ‘Table of Contents’ section (As on year 2020).
數(shù)據(jù)可以如下所示以Pandas的Dataframe格式加載,“ data.shape”打印數(shù)據(jù)的尺寸,其中891代表記錄數(shù),12代表屬性數(shù)。 現(xiàn)在,根據(jù)您文件的位置,“ train.csv”的路徑可能會(huì)有所不同。 您可以直接將路徑復(fù)制并粘貼到“目錄”部分下方的“文件”部分中(截至2020年)。
Output:
輸出:
Viewing an entire data-set at once can be confusing. So, let’s view some sample of the data. ‘data.head()’ gives the ‘starting 5’ and ‘data.tail()’ gives the bottom 5 records/rows of the dataframe based on the index of the row.
一次查看整個(gè)數(shù)據(jù)集可能會(huì)造成混淆。 因此,讓我們查看一些數(shù)據(jù)樣本。 “ data.head()”給出“起始5”,而“ data.tail()”給出基于行的索引的數(shù)據(jù)幀的底部5條記錄/行。
Output:
輸出:
data.head()data.head() data.tail()data.tail()Now, let’s print the columns of the dataframe.
現(xiàn)在,讓我們打印數(shù)據(jù)框的列。
Output:
輸出:
‘data.info()’ gives information about each attribute and the count of non-null/ non-missing values in each attribute and its datatype. As you can see in the output, the attributes, ‘Age’, ‘Cabin’ and ‘Embarked’ have some missing values present in them.(The processing of these missing values will be done in later modules.)
“ data.info()”提供有關(guān)每個(gè)屬性以及每個(gè)屬性及其數(shù)據(jù)類型中非空/非缺失值的計(jì)數(shù)的信息。 從輸出中可以看到,屬性'Age','Cabin'和'Embarked'中存在一些缺失值(這些缺失值的處理將在以后的模塊中完成)。
Output:
輸出:
If you have numerical data in the data-set, ‘data.describe()’ can be used to get count, standard deviation, mean and five number summary i.e minimum, 25%(Q1), 50%(median), 75%(Q3) and maximum of each attribute.
如果數(shù)據(jù)集中有數(shù)字?jǐn)?shù)據(jù),則可以使用“ data.describe()”來獲取計(jì)數(shù),標(biāo)準(zhǔn)偏差,均值和五個(gè)數(shù)字摘要,即最小值,25%(Q1),50%(中位數(shù)),75% (Q3)和每個(gè)屬性的最大值。
www.statisticshowto.comwww.statisticshowto.comOutput:
輸出:
了解數(shù)據(jù) (Understanding the data)
Okay, so we’ve seen the samples of the data. But what does each of the attributes denote. The description of the attributes are provided in the Kaggle itself. But I’ll try to explain it here to get a better gist of it.
好的,我們已經(jīng)看到了數(shù)據(jù)樣本。 但是每個(gè)屬性代表什么。 屬性的說明在Kaggle本身中提供。 但是,我將在這里嘗試解釋它,以便更好地理解它。
There are a total of 891 instances, each consisting of 12 attributes. So here’s a brief information about what the data consist of-
共有891個(gè)實(shí)例,每個(gè)實(shí)例包含12個(gè)屬性。 因此,這是有關(guān)數(shù)據(jù)組成的簡要信息-
Passenger Id: A unique id given for each passenger in the data-set.
乘客ID :為數(shù)據(jù)集中的每個(gè)乘客提供的唯一ID。
2. Survived: It denotes whether the passenger survived or not.
2.幸存 :表示乘客是否幸存。
Here,
這里,
- 0 = Not Survived 0 =未幸存
- 1 = Survived 1 =幸存
3. Pclass: Pclass represents the Ticket class which is also considered as proxy for socio-economic status (SES)
3. Pclass :Pclass代表票證類,也被視為社會(huì)經(jīng)濟(jì)地位(SES)的代理
Here,
這里,
- 1 = Upper Class 1 =上層階級
- 2 = Middle Class 2 =中產(chǎn)階級
- 3 = Lower Class 3 =下層階級
4. Name: Name of the Passenger
4.姓名 :旅客姓名
5. Sex: Denotes the Sex/Gender of the passenger i.e ‘male’ or ‘female’.
5.性別 :表示乘客的性別/性別,即“男性”或“女性”。
6. Age: Denotes the age of the passenger
6.年齡 :表示乘客的年齡
Note: If the passenger’s a baby then it’s age is represented in fraction. e.g. 0.33. If the age is estimated, is it in the form of xx.5. e.g. 18.5
注意:如果乘客是嬰兒,則年齡以分?jǐn)?shù)表示。 例如0.33。 如果估計(jì)年齡,則采用xx.5的形式。 例如18.5
7. SibSp: It represents no. of siblings / spouses aboard the Titanic
7. SibSp :它代表否。 泰坦尼克號上的兄弟姐妹/配偶
The data-set defines family relations in this way…
數(shù)據(jù)集以這種方式定義了家庭關(guān)系…
- Sibling = brother, sister, stepbrother, stepsister 兄弟姐妹=兄弟,姐妹,繼兄弟,繼父
- Spouse = husband, wife (mistresses and fiances were ignored) 配偶=丈夫,妻子(情婦和未婚夫被忽略)
8. Parch: It represents no. of parents / children aboard the Titanic
8. Parch :代表否。 泰坦尼克號上的父母/子女總數(shù)
The dataset defines family relations in this way…
數(shù)據(jù)集以這種方式定義家庭關(guān)系…
- Parent = mother, father 父母=母親,父親
- Child = daughter, son, stepdaughter, stepson 孩子=女兒,兒子,繼女,繼子
- Some children travelled only with a nanny, therefore parch=0 for them. 一些孩子只帶一個(gè)保姆旅行,因此他們的parch = 0。
9. Ticket: It represents the ticket number of the passenger
9.機(jī)票 :代表乘客的機(jī)票號碼
10. Fare: It represents Passenger fare.
10.票價(jià) :代表旅客票價(jià)。
11. Cabin: It represents the Cabin No.
11.機(jī)艙 :代表機(jī)艙號。
12. Embarked: It represents the Port of Embarkation
12.登船 :代表登船港
Here,
這里,
- C = Cherbourg C =瑟堡
- Q = Queenstown Q =皇后鎮(zhèn)
- S = Southampton S =南安普敦
Okay, so now that we have understood the data, let’s hop on to understand the relation between each of the attributes and understand what factors played a major role in the Survival of a Passenger and to also predict if you were in the Titanic, would you have survived or not? Click on the Link to the next story to find out!
好的,現(xiàn)在我們已經(jīng)了解了數(shù)據(jù),讓我們開始了解每個(gè)屬性之間的關(guān)系,并了解哪些因素在旅客的生存中起著重要作用,并預(yù)測您是否在泰坦尼克號上,幸存了沒有? 單擊鏈接到下一個(gè)故事以查找!
Link to the Notebook: Click Here
鏈接到筆記本: 單擊此處
Link to Part 2 of the Blog: Click Here
鏈接到博客的第2部分: 單擊此處
翻譯自: https://medium.com/@bapreetam/exploratory-data-analysis-a-case-study-on-titanic-data-set-part-1-d1376b2a6cef
泰坦尼克數(shù)據(jù)集預(yù)測分析
總結(jié)
以上是生活随笔為你收集整理的泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 离谱!男子竟在高速上与公鸡打架!网友:“
- 下一篇: ml回归_ML中的分类和回归是什么?