當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

泰坦尼克号数据_数据分析-泰坦尼克号乘客生存率预测

發布時間：2023/12/2 编程问答 49 豆豆

生活随笔收集整理的這篇文章主要介紹了泰坦尼克号数据_数据分析-泰坦尼克号乘客生存率预测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

項目背景

目標

預測一個乘客是否能夠在泰坦尼克號事件中幸存。

概述

1912年4月15日，泰坦尼克號在首次航行期間撞上冰山后沉沒，船上共有2224名人員（包括乘客和機組人員），共有1502人不幸遇難。造成海難失事的原因之一是乘客和機組人員沒有足夠的救生艇。盡管在沉船事件中能否幸存有一定的運氣因素，但有些人存活幾率更大，比如女人，孩子以及上流社會人士。通過使用機器學習工具來預測哪些人員在時間中幸存。

理解數據

數據總覽

Titanic生存模型預測，包含了兩組數據：train.csv和test.csv,分別為訓練數據集和測試數據集。

首先，導入數據：

import pandas as pd import numpy as np import re# 導入數據 train_data = pd.read_csv('train.csv') #預覽數據 train_data.head(2)

可以看到，訓練數據集共有12列，其中Survived字段表示該乘客是否獲救，其余為乘客信息，包括： PassengerID：乘客ID Pclass：乘客船艙等級 Name：姓名 Sex：性別 Age：年齡 SibSp：兄弟姐妹數量 Parch：父母子女數量 Ticket：船票信息 Fare：票價 Cabin：客艙信息 * Embarked：登船港口

查看數據整體信息：

train_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KBtrain_data.describe()

從上面可以看出，訓練集共有891名乘客，但是有些屬性數據不全，如Age和Cabin。大約有38%的人員最終獲救；平均年齡大概是29歲，Age的最小值為0.42，表示的應該是嬰兒的年齡，最大值為80歲；Fare票價的平均值為32，中位數為14，平均值是中位數的2.3倍，說明該特征的分布是嚴重的偏右，且最大值為512，很有可能是一個異常值。

數據初步分析

分析各屬性與獲救結果的關系，并選擇合適的可視化方法進行數據可視化分析

數據分類

數值類型：乘客ID（PassengerID），年齡（Age），票價（Fare），兄弟姐妹數量（SibSp），父母子女數量（Parch）

分類數據：

有直接類別的：乘客性別（Sex），客艙等級（Pclass），登船港口（Embarked）

待提取的特征：乘客姓名（Name），客艙號（Cabin），船票號（Ticket）

import matplotlib.pyplot as plt import seaborn as sns

可視化

先從容易入手的3種分類特征進行可視化，SexPclassEmbarked特征分析

fig, axes = plt.subplots(1,3, figsize=(20, 6)) sns.countplot('Sex', hue='Survived', data=train_data, ax=axes[0]) sns.countplot('Pclass', hue='Survived', data=train_data, ax=axes[1]) sns.countplot('Embarked', hue='Survived', data=train_data, ax=axes[2]) <matplotlib.axes._subplots.AxesSubplot at 0x1a253bdac8>

通過觀察各特征的分布情況與目標變量之間的關系，初步得出如下結論： Sex：男性總人數大于女性總人數，但女性的存活率要遠遠高于男性； Pclass：1等艙存活率最高，3等艙存活率明顯低于其他艙，這是由于3等艙的多為普通人，而等級越高的艙位越有可能是當時社會地位較高的人； * Embarked：S港口登船的數量最多，但是獲救率最低；

不同船艙等級下各性別的獲救情況：

train_data[['Sex', 'Pclass', 'Survived']].groupby(['Pclass', 'Sex']).mean()

train_data[['Sex', 'Pclass', 'Survived']].groupby(['Pclass', 'Sex']).mean().plot.bar() <matplotlib.axes._subplots.AxesSubplot at 0x1a25891ef0>

親友的人數與存活與否的關系 SibSp & Parch

fig, axes=plt.subplots(1, 2, figsize=(16, 6)) train_data[['SibSp', 'Survived']].groupby('SibSp').mean().plot.bar(ax=axes[0]) train_data[['Parch', 'Survived']].groupby('Parch').mean().plot.bar(ax=axes[1]) <matplotlib.axes._subplots.AxesSubplot at 0x1a25ae9da0>

從親友人數的獲救概率上來看，獨自一人的乘客獲救概率較低

年齡特征分析 Age

年齡特征分布：

fig, axes = plt.subplots(1, 2, figsize=(16,6)) train_data['Age'].hist(bins=70, ax=axes[0]) axes[0].set_title('Age')train_data.boxplot(column='Age') <matplotlib.axes._subplots.AxesSubplot at 0x1a25be7080>

facet = sns.FacetGrid(train_data, aspect=4, row='Sex') facet.map(sns.kdeplot, 'Age', shade=True) facet.set(xlim=(0, train_data['Age'].max())) facet.add_legend()

不同年齡下的生存分布情況：

facet = sns.FacetGrid(train_data, hue='Survived', aspect=4) facet.map(sns.kdeplot, 'Age', shade=True) facet.set(xlim=(0, train_data['Age'].max())) facet.add_legend()

facet = sns.FacetGrid(train_data, hue='Survived', aspect=4, row='Sex') facet.map(sns.kdeplot, 'Age', shade=True) facet.set(xlim=(0, train_data['Age'].max())) facet.add_legend()

整體觀察得知，0到十幾歲的孩子生存率最高，20-30歲左右的生存率較低，而對于男性來說，0到十幾歲的孩子生存率明顯較高，而對于女性來說，則是30-40的年齡段生存率較高。

票價特征分析Fare

train_data['Fare'].describe() count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64train_data['Fare'].hist(bins=10)

train_data['Fare'][train_data['Survived']==0].mean() 22.117886885245877train_data['Fare'][train_data['Survived']==1].mean() 48.39540760233917

觀察得知，低票價的數量多，而高票價的數量少，且生存乘客的平均票價是遇難乘客的2倍多。

乘客姓名，客艙號，船票號

乘客姓名特征 Name

#定義函數，從姓名中獲取頭銜 def getTitle(name):str1 = name.split(',')[1]str2 = str1.split('.')[0]str3 = str2.strip()return str3Title = pd.DataFrame() Title['Title'] = train_data['Name'].map(getTitle) Title.head()

船艙特征 Cabin

train_data['Cabin'].describe() count 204 unique 147 top C23 C25 C27 freq 4 Name: Cabin, dtype: object

由于船艙的缺失值太多，有效值僅為204，在做特征工程的時候可以丟棄，也可以簡單的將數據分為有cabin記錄和無cabin記錄

train_data['Cabin'] = train_data['Cabin'].fillna('U0') train_data['Has_cabin'] = train_data['Cabin'].apply(lambda x: 0 if x=='U0' else 1)train_data[['Has_cabin', 'Survived']].groupby('Has_cabin').mean().plot.bar()

從分析可知，有船艙信息的乘客生存率較高

特征工程

數據準備

特征工程包括幾個方面：

1. 變量轉換

變量轉換的目的是將數據轉換為適合模型使用的數據，不同模型接受的數據類型不同。Scikit-learn要求數據都是數值型的numeric，所以要將原始數據類型轉換為numeric。

所有的數據都可以歸為兩類：定量型（quantitative）變量：如Age 定性性（qualitative）變量：如Embarked

Qualitative數據轉換

獨熱編碼（Dummy）pd.get_dummies( )

適用于屬性值域較小的特征，如 gender = {‘male’， ‘female’} * Factorizing 因子分解 pd.factorize( )

factorize把相同字符映射為同一個數字，這種映射最后只生產一個特征，不像dummies生成多個特征；

Quantitative數據轉換

Scaling 數據標準化

unscaled data的弊端：1.數據可視化困難；2.數據范圍差異過大可能導致大范圍數值特征具有更高的權重，在某些對特征大小敏感的模型中會影響結果；

常見的scale方法有：

from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import minmax_scale from sklearn.preprocessing import MaxAbsScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import Normalizer

Binning 將連續數據離散化，存儲的值被分布到一些‘箱’中，就像直方圖的bin將數據劃分成幾塊一樣。

2. 缺失值處理

3. 特征工程

衍生變量：對特征進行衍生，產生新特征

在對數據進行特征工程時，我們不僅需要對訓練數據進行處理，還需要同時對測試數據一起處理，使得二者具有相同的數據類型和數據分布。

train_data = pd.read_csv('train.csv') test_data = pd.read_csv('test.csv') test_data['Survived'] = 0 df = train_data.append(test_data) /anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:6211: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default.To accept the future behavior, pass 'sort=False'.To retain the current behavior and silence the warning, pass 'sort=True'.sort=sort) df.head(2)

df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1309 entries, 0 to 417 Data columns (total 12 columns): Age 1046 non-null float64 Cabin 295 non-null object Embarked 1307 non-null object Fare 1308 non-null float64 Name 1309 non-null object Parch 1309 non-null int64 PassengerId 1309 non-null int64 Pclass 1309 non-null int64 Sex 1309 non-null object SibSp 1309 non-null int64 Survived 1309 non-null int64 Ticket 1309 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 132.9+ KB# 計算缺失率 df.isnull().sum()/len(df)Age 0.200917 Cabin 0.774637 Embarked 0.001528 Fare 0.000764 Name 0.000000 Parch 0.000000 PassengerId 0.000000 Pclass 0.000000 Sex 0.000000 SibSp 0.000000 Survived 0.000000 Ticket 0.000000 dtype: float64

合并后數據共1309條，其中Age、Cabin、Embarked、Fare項有缺失，根據具體的情況進行缺失值處理

Fare 票價

Fare僅缺失一個值，可以用平均值填充

df['Fare'] = df['Fare'].fillna(df['Fare'].mean())

Embarked 登船口

Embarked僅缺失了2條數據，可以眾數填充

df['Embarked'].value_counts() S 914 C 270 Q 123 Name: Embarked, dtype: int64 df['Embarked'] = df['Embarked'].fillna('S')

有三種不同的港口，通過dummies轉換為numeric數據

# 為了后面的特征分析，將Embarked特征進行factorizing df['Embarked'] = pd.factorize(df['Embarked'])[0] # 使用get_dummies 獲取 one_hot 編碼 embarked_dummies = pd.get_dummies(df['Embarked'], prefix=df[['Embarked']].columns[0]) df = pd.concat([df, embarked_dummies], axis=1)

Sex 性別

Sex特征無缺失，需要做變量轉換，轉換成numeric類型數據

df['Sex'] = pd.factorize(df['Sex'])[0] sex_dummies = pd.get_dummies(df['Sex'], prefix='Sex') df = pd.concat([df, sex_dummies], axis=1)

Pclass 船艙等級

df['Pclass'] = pd.factorize(df['Pclass'])[0] pclass_dummies = pd.get_dummies(df['Pclass'], prefix='Pclass') df = pd.concat([df, pclass_dummies], axis=1)

Cabin 船艙號

Cabin項的缺失值太多，缺失率達到77%，很難進行分析，作為特征輸入也會影響模型結果。可以舍棄。但是從有無船艙號這一角度，可以創建一個衍生特征，Has_cabin項。

# 將缺失項填充為U0 df['Cabin'] = df['Cabin'].fillna('U0') df['Has_cabin'] = df['Cabin'].apply(lambda x: 0 if x=='U0' else 1)

Name 姓名

觀察數據可知，姓名中包含乘客身份信息的稱呼，需要從姓名中進行提取

# 從name中提取稱呼 df['Title'] = df['Name'].map(lambda x: x.split(',')[1].split('.')[0].strip()) # 建立映射字典 Title_dictionary = {'Capt': 'Officer','Col': 'Officer','Major': 'Officer','Jonkheer':'Royalty','Don': 'Royalty','Sir': 'Royalty','Dr': 'Officer','Rev': 'Officer','the Countess': 'Royalty','Dona': 'Royalty','Mme': 'Mrs','Mlle': 'Miss','Ms': 'Mrs','Mr': 'Mr','Mrs': 'Mrs','Miss': 'Miss','Master': 'Master','Lady': 'Royalty' } df['Title'] = df['Title'].map(Title_dictionary) title_dummies = pd.get_dummies(df['Title'], prefix='Title') df = pd.concat([df, title_dummies], axis=1)

Parch and SibSp

由前面的分析可知，親友的數量對Survived有所影響，這里將兩者合并為FamilySize這一組合項，同時保留這兩列。

family = pd.DataFrame()family['FamilySize'] = df['Parch'] + df['SibSp'] + 1 family['Family_Single'] = family['FamilySize'].map(lambda x: 1 if x == 1 else 0) family['Family_Small'] = family['FamilySize'].map(lambda x: 1 if 2 <= x <=4 else 0) family['Family_Large'] = family['FamilySize'].map(lambda x: 1 if 5 <= x else 0)family.head()

df = pd.concat([df, family], axis=1) df.head()

Age年齡

df['Age'] = df['Age'].fillna(df['Age'].mean())

特征選擇

對特征間的相關性進行分析

corr_df = df.corr()# 查看各個特征與Survived的相關系數corr_df['Survived'].sort_values(ascending = False) Survived 1.000000 Sex 0.404020 Sex_1 0.404020 Title_Miss 0.263140 Has_cabin 0.245239 Title_Mrs 0.235600 Pclass_1 0.208166 Family_Small 0.202162 Pclass 0.175184 Fare 0.173630 Embarked_1 0.096513 Pclass_2 0.062279 Title_Master 0.058265 Parch 0.054908 Embarked 0.048409 Title_Royalty 0.036875 FamilySize 0.020555 Embarked_2 -0.012730 Title_Officer -0.013356 SibSp -0.014375 Age -0.060203 Embarked_0 -0.077095 Family_Large -0.081979 Family_Single -0.154285 Pclass_0 -0.231169 PassengerId -0.331493 Sex_0 -0.404020 Title_Mr -0.411211 Name: Survived, dtype: float64

標準化

標準化的目的主要是消除不同特征之間的量綱和取值范圍不同造成的差異。這些差異，不僅會造成數據偏重不均，還會在可視化方面造成困擾。

使用sklearn.preprocessing.StandardScaler類，該類的好處是可以保存數據集中的參數的「均值、方差」

這里對Age和Fare數據進行標準化處理

from sklearn import preprocessingscale_age_fare = preprocessing.StandardScaler().fit(df[['Age', 'Fare']]) df[['Age', 'Fare']] = scale_age_fare.transform(df[['Age', 'Fare']]) df.head(2)

棄掉無用特征

在特征工程中，我們從一些原始特征中提取來很多要融合到模型中的特征，但是我們還需要提出一些我們用不到或者非數值特征：

首先，對數據進行一下備份，以便后期的再次分析：

df_backup = df df.drop(['PassengerId', 'Cabin', 'Embarked', 'Sex', 'Name', 'Title', 'Pclass', 'Parch', 'SibSp', 'Ticket', 'FamilySize'], axis =1, inplace = True) df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1309 entries, 0 to 417 Data columns (total 21 columns): Age 1309 non-null float64 Fare 1309 non-null float64 Survived 1309 non-null int64 Embarked_0 1309 non-null uint8 Embarked_1 1309 non-null uint8 Embarked_2 1309 non-null uint8 Sex_0 1309 non-null uint8 Sex_1 1309 non-null uint8 Pclass_0 1309 non-null uint8 Pclass_1 1309 non-null uint8 Pclass_2 1309 non-null uint8 Has_cabin 1309 non-null int64 Title_Master 1309 non-null uint8 Title_Miss 1309 non-null uint8 Title_Mr 1309 non-null uint8 Title_Mrs 1309 non-null uint8 Title_Officer 1309 non-null uint8 Title_Royalty 1309 non-null uint8 Family_Single 1309 non-null int64 Family_Small 1309 non-null int64 Family_Large 1309 non-null int64 dtypes: float64(2), int64(5), uint8(14) memory usage: 99.7 KB df.head()

構建模型

劃分訓練數據集和測試數據集

train_data = df[:891] test_data = df[891:]train_data_X = train_data.drop(['Survived'], axis=1) train_data_Y = train_data['Survived']test_data_X = test_data.drop(['Survived'], axis=1) train_data_X.shape (891, 20) from sklearn.model_selection import train_test_split#建立模型用的訓練數據集和測試數據集 train_X, test_X, train_y, test_y = train_test_split(train_data_X, train_data_Y, train_size=.8) /anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.FutureWarning) train_X.shape (712, 20) test_X.shape (179, 20)

選擇機器學習算法

線性回歸算法

from sklearn.linear_model import LogisticRegression model = LogisticRegression()

訓練模型

model.fit(train_X, train_y)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='warn',n_jobs=None, penalty='l2', random_state=None, solver='warn',tol=0.0001, verbose=0, warm_start=False)

評估模型

model.score(test_X, test_y) 0.8212290502793296

方案實施

pred_Y = model.predict(test_data_X) pred_Y array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

總結

以上是生活随笔為你收集整理的泰坦尼克号数据_数据分析-泰坦尼克号乘客生存率预测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：怎么重置路由器IP艾泰路由器如何重置
下一篇： tmemo 选择消除行_Divi模块，行