日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

python泰坦尼克号数据预测_使用python预测泰坦尼克号生还

發布時間:2024/7/23 python 39 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python泰坦尼克号数据预测_使用python预测泰坦尼克号生还 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

簡介

Titanic是Kaggle競賽的一道入門題,參賽者需要根據旅客的階級、性別、年齡、船艙種類等信息預測其是否能在海難中生還,詳細信息可以參看https://www.kaggle.com/,本文的分析代碼也取自 kaggle 中該競賽的 kernal。

數據介紹

給出的數據格式如下:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S

2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C

數據項的含義如下:

PassengerId:乘客ID

Survived:是否生還,0表示遇難,1表示生還

Pclass:階級,1表示最高階級,3最低

Name:姓名

Sex:性別

Age:年齡

SibSp:同乘船的兄弟姐妹的數量

Parch:是否有配偶同乘,1表示是

Ticket:船票編號

Fare:恐懼指數

Cabin:船艙號

Embarked:登船港口

問題分析

這是一個比較典型的基于特征的分類問題,根據一般的數據處理流程可以將問題的求解分解成為以下步驟:

數據預處理

讀取數據,在本文代碼中使用了 python 的 pandas 包管理數據結構

特征向量化,在本文代碼中將性別和登船港口特征轉成向量化表示

處理殘缺數據,在本文代碼中將殘缺年齡用平均年齡表示,殘缺的登船港口用頻繁項表示

扔掉多余項,姓名、ID、艙號、票號在本問題中被認為是對分類沒有幫助的信息,扔掉了這些特征項

數據訓練

在本文代碼中使用了 sklearn 中的隨機森林進行分類,隨機森林每次隨機選取若干特征和數據項生成決策樹,最后采用投票的方式來生成預測結果,本文代碼中將第一列作為分類項,后n列作為特征項,隨機生成100棵決策樹對數據進行訓練

預測并生成結果

代碼實現

import pandas as pd

import numpy as np

import csv as csv

from sklearn.ensemble import RandomForestClassifier

# Data cleanup

# TRAIN DATA

train_df = pd.read_csv('train.csv', header=0) # Load the train file into a dataframe

# I need to convert all strings to integer classifiers.

# I need to fill in the missing values of the data and make it complete.

# female = 0, Male = 1

train_df['Gender'] = train_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

# Embarked from 'C', 'Q', 'S'

# Note this is not ideal: in translating categories to numbers, Port "2" is not 2 times greater than Port "1", etc.

# All missing Embarked -> just make them embark from most common place

if len(train_df.Embarked[ train_df.Embarked.isnull() ]) > 0:

train_df.Embarked[ train_df.Embarked.isnull() ] = train_df.Embarked.dropna().mode().values

Ports = list(enumerate(np.unique(train_df['Embarked']))) # determine all values of Embarked,

Ports_dict = { name : i for i, name in Ports } # set up a dictionary in the form Ports : index

train_df.Embarked = train_df.Embarked.map( lambda x: Ports_dict[x]).astype(int) # Convert all Embark strings to int

# All the ages with no data -> make the median of all Ages

median_age = train_df['Age'].dropna().median()

if len(train_df.Age[ train_df.Age.isnull() ]) > 0:

train_df.loc[ (train_df.Age.isnull()), 'Age'] = median_age

# Remove the Name column, Cabin, Ticket, and Sex (since I copied and filled it to Gender)

train_df = train_df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

# TEST DATA

test_df = pd.read_csv('test.csv', header=0) # Load the test file into a dataframe

# I need to do the same with the test data now, so that the columns are the same as the training data

# I need to convert all strings to integer classifiers:

# female = 0, Male = 1

test_df['Gender'] = test_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

# Embarked from 'C', 'Q', 'S'

# All missing Embarked -> just make them embark from most common place

if len(test_df.Embarked[ test_df.Embarked.isnull() ]) > 0:

test_df.Embarked[ test_df.Embarked.isnull() ] = test_df.Embarked.dropna().mode().values

# Again convert all Embarked strings to int

test_df.Embarked = test_df.Embarked.map( lambda x: Ports_dict[x]).astype(int)

# All the ages with no data -> make the median of all Ages

median_age = test_df['Age'].dropna().median()

if len(test_df.Age[ test_df.Age.isnull() ]) > 0:

test_df.loc[ (test_df.Age.isnull()), 'Age'] = median_age

# All the missing Fares -> assume median of their respective class

if len(test_df.Fare[ test_df.Fare.isnull() ]) > 0:

median_fare = np.zeros(3)

for f in range(0,3): # loop 0 to 2

median_fare[f] = test_df[ test_df.Pclass == f+1 ]['Fare'].dropna().median()

for f in range(0,3): # loop 0 to 2

test_df.loc[ (test_df.Fare.isnull()) & (test_df.Pclass == f+1 ), 'Fare'] = median_fare[f]

# Collect the test data's PassengerIds before dropping it

ids = test_df['PassengerId'].values

# Remove the Name column, Cabin, Ticket, and Sex (since I copied and filled it to Gender)

test_df = test_df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

# The data is now ready to go. So lets fit to the train, then predict to the test!

# Convert back to a numpy array

train_data = train_df.values

test_data = test_df.values

print 'Training...'

forest = RandomForestClassifier(n_estimators=100)

forest = forest.fit( train_data[0::,1::], train_data[0::,0] )

print 'Predicting...'

output = forest.predict(test_data).astype(int)

predictions_file = open("myfirstforest.csv", "wb")

open_file_object = csv.writer(predictions_file)

open_file_object.writerow(["PassengerId","Survived"])

open_file_object.writerows(zip(ids, output))

predictions_file.close()

print 'Done.'

后續思考

這是一個比較簡單流程也較為完整的解決方案,但是也存在一些問題,比如

沒有對測試結果的準確率和召回率進行評估

模型的參數選擇是否可以進一步調整取得更好的效果?

如果采用一些集成學習的辦法效果會不會進一步提升?

創作挑戰賽新人創作獎勵來咯,堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的python泰坦尼克号数据预测_使用python预测泰坦尼克号生还的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。