泰坦尼克号 数据分析_第1部分:泰坦尼克号-数据分析基础
泰坦尼克號(hào) 數(shù)據(jù)分析
My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.
我的目標(biāo)是更好地了解如何使用表格數(shù)據(jù),因此我挑戰(zhàn)自我并開(kāi)始了Titanic項(xiàng)目。 我認(rèn)為這是學(xué)習(xí)python數(shù)據(jù)分析基礎(chǔ)知識(shí)的絕佳方法。
You can find the competition here: https://www.kaggle.com/c/titanicI really recommend you to try it yourself if you want to learn how to analyze the data and build machine learning models.
您可以在這里找到比賽: https : //www.kaggle.com/c/titanic如果您想學(xué)習(xí)如何分析數(shù)據(jù)和建立機(jī)器學(xué)習(xí)模型,我真的建議您自己嘗試一下。
I started by uploading the packages:
我首先上傳了軟件包:
import pandas as pd import numpy as npimport matplotlib.pyplot as plt
import seaborn as sns
Pandas is a great package for tabular data analysis. Numpy provides a high-performance multidimensional array object and tools for working with these arrays. Matplotlib packages help you to generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Seaborn is developed based on the Matplotlib library and it can be used to create attractive and informative statistical graphics.
Pandas是用于表格數(shù)據(jù)分析的出色軟件包。 Numpy提供了高性能的多維數(shù)組對(duì)象和用于處理這些數(shù)組的工具。 Matplotlib軟件包可幫助您僅用幾行代碼即可生成圖,直方圖,功率譜,條形圖等。 Seaborn是基于Matplotlib庫(kù)開(kāi)發(fā)的,可用于創(chuàng)建引人入勝且內(nèi)容豐富的統(tǒng)計(jì)圖形。
After loading these packages I loaded the data:
加載這些軟件包后,我加載了數(shù)據(jù):
df=pd.read_csv("train.csv")Then I had a quick look at the data:
然后,我快速瀏覽了一下數(shù)據(jù):
df.head()#This prints you the first 5 rows of the table
#If you want to print 10 rows of the table instead of 5, then use
df.head(10)Screenshot of the first rows第一行的屏幕截圖 df.tail()
# This prints you out the last five rows of the table
I recommend starting with a look at the data so that you can be sure everything is as it should be. This is how you can avoid stupid mistakes in further analysis.
我建議先查看數(shù)據(jù),以確保所有內(nèi)容都應(yīng)該是正確的。 這樣可以避免進(jìn)一步分析中的愚蠢錯(cuò)誤。
df.shape#This prints you the number of rows and columns
It is a good habit to print out the shape of the data in the beginning so you can check the number of columns and rows and be sure you haven’t missed any data during the analysis.
在開(kāi)始時(shí)打印出數(shù)據(jù)的形狀是個(gè)好習(xí)慣,因此您可以檢查列數(shù)和行數(shù),并確保在分析過(guò)程中沒(méi)有遺漏任何數(shù)據(jù)。
分析數(shù)據(jù) (Analyze the data)
Then I continued to look at the data by counting the values. This gave me a lot of information about the content of the data.
然后,我繼續(xù)通過(guò)計(jì)算值來(lái)查看數(shù)據(jù)。 這給了我很多有關(guān)數(shù)據(jù)內(nèi)容的信息。
df['Pclass'].value_counts()# Prints out count of classes valuesThe number of persons in each class. 3rd class was the most popular.每個(gè)班級(jí)的人數(shù)。 第三類(lèi)是最受歡迎的。
I prefer using percentages to showcase values. It is easier to understand the values in percentages.
我更喜歡使用百分比來(lái)展示價(jià)值。 更容易理解百分比值。
df['Pclass'].value_counts(normalize=True)# same as above just that using "normalize=True" value is printed in percentages55% of people were in 3rd class55%的人在三等艙
I counted values for each column separately. In the future, I challenge myself to do the function which prints out values but it was not my scope in this project.
我分別計(jì)算每列的值。 將來(lái),我會(huì)挑戰(zhàn)自己執(zhí)行輸出值的功能,但這不是我在本項(xiàng)目中的工作范圍。
I wanted to understand also the values of different columns so I used the describe() method for that.
我還想了解不同列的值,因此我使用了describe()方法。
df['Fare'].describe()# describe() is used to view basic statistical details like count, mean, minimum and maximum values.“Fare” column values“票價(jià)”列值
Here you can see for example that the minimum price for the ticket was 0,00 $ and the maximum price was 512,33 $.
例如,在這里您可以看到門(mén)票的最低價(jià)格為0,00 $,最高價(jià)格為512,33 $。
I did several crosstables to understand which were the determinant values for the surviving.
我做了幾個(gè)交叉表,以了解哪些是生存的決定性?xún)r(jià)值。
pd.crosstab(df['Survived'], df['Sex'])# crosstable number of sex based on surviving.Here I also recommend using percentages instead of numerical values在這里,我還建議使用百分比而不是數(shù)值 pd.crosstab(df['Survived'], df['Sex'], normalize=True)
# Using "normalize=True", you get values in percentage.Same as above just in percentages與上面相同,只是百分比
Doing crosstables with different values gives you information about the possible correlations between the variables, for example, sex and surviving. As you can see, 26% of women survived and most of the men, 52%, didn’t survive.
使用不同的值進(jìn)行交叉表可為您提供有關(guān)變量之間可能的相關(guān)性的信息,例如性別和存活率。 如您所見(jiàn),有26%的女性幸存下來(lái),而大多數(shù)男性(52%)沒(méi)有幸存。
可視化數(shù)據(jù) (Visualize the data)
It is nice to have numerical values in tables but it is easier to understand the visualized data, at least for me. This is why I plotted histograms and bar charts. By creating histograms and bar charts I learned how to visualize the data. Here are a few examples:
在表格中有數(shù)值很高興,但至少對(duì)于我來(lái)說(shuō),更容易理解可視化數(shù)據(jù)。 這就是為什么我繪制直方圖和條形圖的原因。 通過(guò)創(chuàng)建直方圖和條形圖,我學(xué)習(xí)了如何可視化數(shù)據(jù)。 這里有一些例子:
df.hist(column='Age')In this histogram, you can see that passengers were mostly 20–40 years old.在此直方圖中,您可以看到乘客的年齡大多為20-40歲。I used seaborn library for the bar charts.
我使用seaborn庫(kù)制作條形圖。
sns.countplot(x='Sex', hue='Survived', data=df);More females survived than males.存活下來(lái)的女性多于男性。Also, I used a heatmap to see the correlation between different columns.
另外,我使用熱圖來(lái)查看不同列之間的相關(guān)性。
corrmat = df.corr()f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size': 15});
Heatmap shows that there is a strong negative correlation between Fares and Classes, so that when one increases other decreases. It is logical because ticket prices in the 1st class are higher than in the 3rd class.
熱圖顯示,票價(jià)和艙位之間有很強(qiáng)的負(fù)相關(guān)性,因此當(dāng)票價(jià)增加時(shí),其他票價(jià)會(huì)下降。 這是合乎邏輯的,因?yàn)榈谝活?lèi)的機(jī)票價(jià)格高于第三類(lèi)的機(jī)票價(jià)格。
If we focus on analyzing the correlations between surviving and other values, we see that there is a strong positive correlation between surviving and fare. The probability to survive is higher when the ticket price has been higher.
如果我們專(zhuān)注于分析幸存值與其他值之間的相關(guān)性,我們會(huì)發(fā)現(xiàn)幸存率和票價(jià)之間存在很強(qiáng)的正相關(guān)性。 當(dāng)門(mén)票價(jià)格較高時(shí),生存的可能性較高。
You can find the project in Github. please feel free to try it yourself and comment if there is something that needs clarifying!
您可以在Github中找到該項(xiàng)目。 請(qǐng)隨時(shí)嘗試一下,如果有需要澄清的地方,請(qǐng)發(fā)表評(píng)論!
Thank you for the highly trained monkey (Risto Hinno) for motivating and inspiring me!
感謝您訓(xùn)練有素的猴子( Risto Hinno )激勵(lì)和啟發(fā)我!
翻譯自: https://medium.com/swlh/part-1-titanic-basic-of-data-analysis-ab3025d29f6e
泰坦尼克號(hào) 數(shù)據(jù)分析
總結(jié)
以上是生活随笔為你收集整理的泰坦尼克号 数据分析_第1部分:泰坦尼克号-数据分析基础的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 孕妇梦到狗咬咬自己预示什么
- 下一篇: 趣味数据故事_坏数据的好故事