當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

泰坦尼克号—数据分析（单因素、多因素分析）

發布時間：2023/12/31 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了泰坦尼克号—数据分析（单因素、多因素分析）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、提出問題：
獲救與其他因素（性別、年齡、艙位）的關系大小

二、整理數據：
數據來源：經典的titanic數據分析，大多數人都會從這個案例做教學或者做練習，數據可從kaggle（https://www.kaggle.com/c/titanic/data）上一個機器學習的數據集獲得，kaggle有三個表格，我們現在用train這個表。

工具：jupyter notebook，可以更好的展示分析思維和過程。

導入python的數據分析庫

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline #可視化在頁面展示

導入數據

df = pd.read_csv(r'C:\Users\jessie\train.csv',engine='python')

查看數據
行列數

df.shape #輸出：(891, 12)

查看數據信息

df.info()#輸出： <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

如果只想單純查看數據的數據類型，可以用dtypes

df.dtypes #輸出： PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object

查看列名

df.columns #輸出 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],dtype='object')

做到這里我們對泰坦尼克號的數據有了基本了解：
1.數據一共有891行，12列；
2.列【Age】、【Cabin】、【Embarked】有缺失值；
3.有些數據需要修改數據類型，如【PassengerId】，ID是純文本，不應是數字類型

我們對數據作更進一步了解，可以查看數據的前N行或后N行

df.head() #輸出

ps：如果你用的其他例子列數比較多，可以用pd.set_option(‘max_columns’, 1000)來展示所有的列
展示的前5行，列【Cabin】有三個NaN缺失值，處理缺失值的方法有刪除、填充中位數或者平均數，刪除會使數據量減少，如果數據量大還好，數據量小會影響整個的分析結果，填充平均數或者中位數是更常見的方法

三、數據清洗
如前面所說，有三列數據是有缺失值，選擇不刪除用填充的方法來解決，【Age】這一列用平均值來填充

列【Embarked】只有兩個缺失值，這里我們用中位數來填充

【Cabin】這列缺失值太多，先放著

至此，我們的數據清洗已完成，下面開始數據分析+視化

四、數據分析

1.單因素分析

1.1、先計算泰坦尼克號的獲救人數和獲救率

total_survived = df.Survived.value_counts() #對列【Survived】進行計數 total_survived.index = total_survived.index.astype('str') #將index由數字0，1轉化為字符串“0”，“1” _x = total_survived.index #賦值 _y = total_survived.values #賦值#下面畫圖 plt.figure(figsize=(10, 5), dpi=80) #figsize設置畫布大小，dpi設置圖片的精度 plt.subplot(121) #畫兩個左右對稱的子圖，現在是畫第一個plt.bar(_x[0], _y[0],label='survived',color='#39CC6A',align='center') plt.bar(_x[1], _y[1],label='not survived',color='#FF9361',align='center') plt.xlabel('survived or not',fontsize=10) plt.ylabel('count',fontsize=10) plt.ylim(0,600) plt.title('Survival Count')plt.subplot(122) #現在是畫第二個 plt.pie(total_survived, labels=total_survived.index,colors=['#FF9361','#39CC6A'], autopct='%3.0f%%', startangle=230,pctdistance = 0.6, labeldistance = 1.1) plt.title('Survival Rate') plt.axis('equal') #設置為正圓plt.show()

由圖形可以看出，這891名乘客中，獲救的占38%，沒獲救的占比62，死亡率很高

1.2、下面是計算Pclass和獲救的關系

df['Pclass'] = df['Pclass'].astype('str') #將數字類型改為字符串 df_survived = df['Pclass'][df['Survived'] == 1] #將獲救的Pclass數據取出來，右邊等式返回的是索引值和Pclass值 df_not_survived = df['Pclass'][df['Survived'] == 0]plt.figure(figsize=(5, 5),dpi = 80)plt.hist([df_survived df_not_survived], stacked=True,color=['#39CC6A','#FF9361'],label=['Survived','not Survived']) plt.xticks(['1','2','3'],['Upper','Middle','lower']) plt.legend() plt.xlabel('Pclass',fontsize=10) plt.ylabel('count',fontsize=10) plt.title('Pclass_Survived') plt.ylim(0,600)plt.show()

結論：第3層的船艙人數越多，獲救率反而最小，船艙等級越高，獲救率越大

1.3、下面是計算性別對獲救的影響

df_sex1=df['Sex'][df['Survived']==1] df_sex0=df['Sex'][df['Survived']==0]plt.figure(figsize=(5, 5), dpi=80) plt.hist([df_sex1,df_sex0],stacked=True,color=['#39CC6A','#FF9361'],label=['Survived','not Survived'],rwidth=10) plt.xticks([-1,0,1,2],[-1,'F','M',2]) plt.legend() plt.xlabel('Sex',fontsize=10) plt.ylabel('count',fontsize=10) plt.title('Sex_Survived')plt.show()

由數據可以看出，船上男性的人數比女性多，但女性的獲救率遠大于男性

1.4、計算年齡對獲救的影響

先對年齡分層

def age_level(age):if age <= 9:return str('1')elif age <=24:return str('2')elif age <=59:return str('3')else:return str('4')

這個分類標準標準是小于等于9歲的是兒童，小于等于24歲的是青年，小于等于59歲的是中年，大于59歲的是老年

plt.figure(figsize=(5, 5), dpi=80) df_age1=df['age_level'][df['Survived']==1] df_age0=df['age_level'][df['Survived']==0] plt.hist([df_age1,df_age0],stacked=True,color=['#39CC6A','#FF9361'],label=['Survived','not Survived']) plt.xticks(['1','2','3','4'],['child','youth','middle','elderly']) plt.legend() plt.xlabel('Age_level',fontsize=10) plt.ylabel('count',fontsize=10) plt.title('Age_Survived') plt.ylim(0,700)plt.show()

可以看到中年人（24-59歲）的人數最多，獲救人數也最多，兒童（0-9歲）的獲救率最高

2多因素分析

2.1、Age和Pclass共同對獲救的影響

先導入其他的庫

from __future__ import division from scipy import stats import seaborn as sns

定義獲救人員

survives_passenger_df=df[df['Survived']==1] #定義幾個常用的方法#按照xx對乘客進行分組，計算每組的人數 def xx_group_all(df,xx):#按照xx對乘客進行分組后，每個組的人數return df.groupby(xx)['PassengerId'].count()#計算每個組的生還率 def group_passenger_survived_rate(xx):#按xx對乘客進行分組后每個組的人數group_all=xx_group_all(df,xx)#按xx對乘客進行分組后每個組生還者的人數group_survived_value=xx_group_all(survives_passenger_df,xx)#按xx對乘客進行分組后，每組生還者的概率return group_survived_value/group_all#輸出餅圖 def print_pie(group_data,title):group_data.plt.pie(title=title,figsize=(6,6),autopct='%.2f%%'\,startangle=90,legend=True)#輸出柱狀圖 def print_bar(data,title):bar=data.plot.bar(title=title)for p in bar.patches:bar.annotate('%.2f%%'%(p.get_height()*100),(p.get_x()*1.005\,p.get_height()*1.005)) print_bar(group_passenger_survived_rate(['Sex','Pclass']),'Sex_Pclass_Survived')

可以看到，對獲救率的影響Age>Pclass，其次是Pclass對獲救率的影響是1>2>3等級

2.2、性別和年齡共同對獲救率的影響

#按Pclass分組計算每組的人數 def Pclass_survived_all(data,Pclass):return data.groupby(Pclass)['Sex'].count() dd0=df[['age_level','Sex','Pclass']] dd11=df[['age_level','Sex','Pclass']][df['Survived']==1] c=Pclass_survived_all(dd11,['age_level','Sex','Pclass']) dd0['Sex'].count() #按Pclass分組計算每組的生還率 def Pclass_survived_probability(data):#計算每組生還者的人數groupby_survived=Pclass_survived_all(dd11,data)#計算每組的總人數groupby_survived_all=Pclass_survived_all(dd0,data)return groupby_survived/groupby_survived_all print_bar(Pclass_survived_probability(['Sex','age_level']),'Sex_Sge_Survived')

可以看出，對獲救率影響大的是性別，女性>男性
其次兒童的獲救率大于青年、中年和老年，青年跟中年的獲救率差不多，老年人最低。

2.3、年齡和乘客等級共同對生還率的影響

print_bar(Pclass_survived_probability(['age_level','Pclass']),'age_pclass_Survivedd')

可以看出乘客的等級對獲救率的影響>乘客年齡的影響
年齡越大獲救率越小，乘客等級越差獲救率越差

五、結論
通過分析，可以看出對獲救率影響最大的因素是乘客等級，其次是性別，最后年齡段也對生化率有影響

六、分析的局限性

這里并沒有從統計上分析得出這些結果的偶然性，所以并不知道這里的結果是真正的差異造成的還是噪音造成的
年齡字段有一些缺失值，因為是連續數據這里用的是全體乘客年齡的均值填充缺失值，這樣會縮小年齡之間的差異，也會影響分析結果

七、結果的相關性
這里的數據并非通過試驗得出，所以無法說自變量之間的因果性，只能說她們之間有相關性

八、參考文章：
https://www.jianshu.com/p/17f99100525a
https://zhuanlan.zhihu.com/p/30920420

總結

以上是生活随笔為你收集整理的泰坦尼克号—数据分析（单因素、多因素分析）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： java中基本数据类型和引用数据类型各有
下一篇： surface 哪个系列适合java开发