日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Titanic(泰坦尼克号生存预测)---(1)

發(fā)布時(shí)間:2025/3/15 编程问答 16 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Titanic(泰坦尼克号生存预测)---(1) 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

我是初學(xué)者哈,有問(wèn)題歡迎大家指出。一起加油,共同進(jìn)步!
關(guān)于數(shù)據(jù)以及代碼:

# data analysis and wrangling import pandas as pd import numpy as np import random as rnd# visualization import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline# machine learning from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC, LinearSVC from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import Perceptron from sklearn.linear_model import SGDClassifier from sklearn.tree import DecisionTreeClassifier

讀取數(shù)據(jù)

train_df = pd.read_csv('data/泰坦尼克號(hào)生存率/train.csv') test_df = pd.read_csv('data/泰坦尼克號(hào)生存率/test.csv') combine = [train_df, test_df] #特征屬性值以及前五個(gè)數(shù)據(jù)樣本 print(train_df.columns.values) train_df.head() # 查看數(shù)據(jù)集的缺失情況 train_df.info() print('_'*50) test_df.info() out: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 66.2+ KB __________________________________________________ <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5)

得到結(jié)論:
數(shù)據(jù)缺失情況:

對(duì)于訓(xùn)練數(shù)據(jù):cabin信息缺失很多,age部分缺失,再是embarked少量缺失
對(duì)于測(cè)試數(shù)據(jù):cabin>age

數(shù)據(jù)類型:
7+5
6+5

對(duì)缺失數(shù)據(jù)進(jìn)行處理

缺失數(shù)據(jù)處理方法
先看缺失值最少的embarked:

# 因?yàn)橹蝗鄙賰蓚€(gè)值,因而大部分方法都可以使用,從簡(jiǎn),直接插入出現(xiàn)頻率最高的值 freq_port = train_df.Embarked.dropna().mode()[0]# 得到出現(xiàn)頻率最高的特征值 freq_port for dataset in combine:dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)#當(dāng)該特征值為空值時(shí),插入出現(xiàn)頻率最高的值train_df.info() train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False) #根據(jù)Embarked進(jìn)行分類,并計(jì)算出其與是否生存的關(guān)系,或者說(shuō)是每個(gè)港口的存活率。 根據(jù)輸出值,可以得出Embarked已經(jīng)完全填補(bǔ),而且c港口的生存概率最高 out: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 891 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 66.2+ KBEmbarked Survived 0 C 0.553571 1 Q 0.389610 2 S 0.339009

年齡采用均值插補(bǔ)法

age_mean=dataset['Age'].mean() age_meanfor dataset in combine:dataset['Age'] = dataset['Age'].fillna(age_mean)train_df.info() train_df[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Survived', ascending=False)

cabin可以直接丟棄

  • 缺失數(shù)據(jù)過(guò)大
  • 該特征值與存活率相關(guān)不大
train_df = train_df.drop(['Name', 'PassengerId'], axis=1) test_df = test_df.drop(['Name'], axis=1)combine = [train_df, test_df] train_df.shape, test_df.shapetest_df = test_df.drop(['Ticket','Cabin'], axis=1) train_df = train_df.drop(['Ticket','Cabin'], axis=1) combine = [train_df, test_df] train_df.shape, test_df.shape #train_df.head()

將數(shù)據(jù)規(guī)格化

對(duì)于

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的Titanic(泰坦尼克号生存预测)---(1)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。