日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle入门竞赛之泰坦尼克事故存活预测(xgboost方法)

發(fā)布時間:2025/3/19 编程问答 37 豆豆
生活随笔 收集整理的這篇文章主要介紹了 kaggle入门竞赛之泰坦尼克事故存活预测(xgboost方法) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

傳送門:https://www.missshi.cn/api/view/blog/5a06a441e519f50d0400035e

本文我們詳細(xì)講解如何利用xgboost方法來解決泰坦尼克沉船事故人員存活預(yù)測的問題。
實現(xiàn)語言以Python為例來進行講解。

評價標(biāo)準(zhǔn)

我們的目標(biāo)是預(yù)測泰坦尼克號中哪些乘客可以幸存下來。

評價維度就是預(yù)測的準(zhǔn)確率。

需要上傳的文件格式如下
一個csv文件,包含418行數(shù)據(jù)和1行Title。
每行都是有兩列,其中,第一列為乘客的ID,第二列為是否能夠幸存(幸存為1,否則為0)。

例如:

PassengerId,Survived
892,0
893,1
894,0
Etc.

數(shù)據(jù)集

https://pan.baidu.com/s/1pxgXW4s075j7zLWQpeoc4w

數(shù)據(jù)集中輸入的特征包含如下字段:

  • survived:是否幸存(1為幸存,0為死亡)
  • pclass:船票檔次,分為1,2,3三類。
  • Name:乘客姓名
  • Age:乘客的年齡
  • SibSp:船上兄弟姐妹的數(shù)目
  • Parch:船上父母子女的數(shù)目
  • Ticket:船票號碼
  • Fare:船票價格
  • Cabin:客艙號碼
  • Embarked:登船港口
  • 第三方庫引入

    首先,我們來看下用xgboost解決這個問題需要引入哪些第三方庫吧:

    # Load in our libraries import pandas as pd import numpy as np import re import sklearn import xgboost as xgb import seaborn as sns import matplotlib.pyplot as plt %matplotlib inlineimport plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go import plotly.tools as tlsimport warnings warnings.filterwarnings('ignore')# Going to use these 5 base models for the stacking from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier from sklearn.svm import SVC from sklearn.cross_validation import KFold;

    其中,numpy和pandas是在進行數(shù)據(jù)計算和分析中最常用的第三方庫。
    re是正則表達式庫。
    sklearn是專門用于機器學(xué)習(xí)的第三方庫。
    matplotlib,seaborn和plotly是Python用于繪圖的第三方庫。
    xgboost是Python基于xgboost算法開發(fā)的第三方庫。

    特征的分析和提取

    在傳統(tǒng)機器學(xué)習(xí)算法中,我們首先需要分析數(shù)據(jù)的內(nèi)在結(jié)構(gòu),找出數(shù)據(jù)的結(jié)構(gòu)特征信息。

    # Load in the train and test datasets train = pd.read_csv('../input/train.csv') test = pd.read_csv('../input/test.csv')# Store our passenger ID for easy access PassengerId = test['PassengerId']train.head(3)

    我們利用pandas庫的方法直接讀入excel方法后,讀取訓(xùn)練集的前三行數(shù)據(jù)如下:

    full_data = [train, test] # 特征處理:額外添加一些需要從已有數(shù)據(jù)中計算得到的其他特征 # 1.添加姓名長度 train['Name_length'] = train['Name'].apply(len) test['Name_length'] = test['Name'].apply(len) # 2.是否有Cabin(客艙號碼),判斷 null train['Has_Cabin'] = train['Cabin'].apply(lambda x:0 if type(x) == float else 1) test['Has_Cabin'] = test['Cabin'].apply(lambda x:0 if type(x) == float else 1) # 3.計算全部家人數(shù)目 for dataset in full_data:dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1 # 4.是否是一個人 for dataset in full_data:dataset['IsAlone'] = 0# loc[position,填寫的列]dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1 # 5.對Embarked(登船港口)列的空值進行處理 for dataset in full_data:dataset['Embarked'] = dataset['Embarked'].fillna('S') # 6.用訓(xùn)練集數(shù)據(jù)的Fare(船票價格)的中值來填充所有Fare為空的數(shù)據(jù) for dataset in full_data:dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median()) train['CategoricalFare'] = pd.qcut(train['Fare'], 4) # 分為四類 # 7.創(chuàng)建一個新的特征CategoricalAge for dataset in full_data:age_avg = dataset['Age'].mean()age_std = dataset['Age'].std()age_null_count = dataset['Age'].isnull().sum()age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_listdataset['Age'] = dataset['Age'].astype(int)train['CategoricalAge'] = pd.cut(train['Age'], 5) # 8.定義函數(shù),用于查詢姓名中的Title(稱謂) def get_title(name):title_search = re.search('([A-Za-z]+)\.', name)# 若存在則提取它并返回if title_search:return title_search.group(1)return "" # 9.創(chuàng)建一個新的變量 Title for dataset in full_data:dataset['Title'] = dataset['Name'].apply(get_title) # 10.將一些不常見的Title轉(zhuǎn)換為一些對應(yīng)的常見Title類型 for dataset in full_data:dataset['Title'] = dataset['Title'].replace(['Lady','Countess','Capt','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Rare')dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')dataset['Title'] = dataset['Title'].replace('Ms','Miss')dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') # 11.字符型映射到整數(shù) for dataset in full_data:# 將性別映射到0,1dataset['Sex'] = dataset['Sex'].map({'female':0, 'male':1}).astype(int)# 將Tile映射到 0-5title_mapping = {"Mr":1, "Miss":2, "Mrs":3, "Master":4, "Rare":5}dataset['Title'] = dataset['Title'].map(title_mapping)dataset['Title'] = dataset['Title'].fillna(0)# 將Embarked映射到0-2dataset['Embarked'] = dataset['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype(int)# 將Fare分為四類0-3# 并在符合條件的數(shù)據(jù)行添加Fare標(biāo)簽dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3dataset['Fare'] = dataset['Fare'].astype(int)# 將年齡分為5類:0-4dataset.loc[dataset['Age'] <= 16, 'Age'] = 0dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3dataset.loc[dataset['Age'] > 64, 'Age'] = 4

    接下來,我們需要清除一些我們無法直接利用的一些特征:

    # 12. 去除無法直接利用的特征 drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp'] train = train.drop(drop_elements, axis=1) train = train.drop(['CategoricalAge', 'CategoricalFare'], axis=1) test = test.drop(drop_elements, axis=1)

    到目前為止,我們已經(jīng)對特征進行了加工、處理和過濾。

    接下來,我們需要簡單的通過當(dāng)前的數(shù)據(jù)進行一些可視化來幫助我們進一步進行分析。

    print(train.head(3))


    接下來,讓我們看一下目前這些特征之間的相關(guān)性吧:

    colormap = plt.cm.viridis plt.figure(figsize=(14,12)) plt.title('Pearson Correlation of Features', y=1.05, size=15) sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

    相關(guān)系數(shù)圖如下:

    Pearson相關(guān)系數(shù)(Pearson CorrelationCoefficient)是用來衡量兩個數(shù)據(jù)集合是否在一條線上面,它用來衡量定距變量間的線性關(guān)系。

    當(dāng)Pearson相關(guān)系數(shù)越接近1時,表示兩個特征之間的相關(guān)性越強;而當(dāng)兩個特征之間的相關(guān)性越接近于0時,表示兩個特征之間的相關(guān)性越低。

    其他參數(shù)計算

    下面,我們需要計算一些在后續(xù)訓(xùn)練過程中需要使用的參數(shù):

    ntrain = train.shape[0] ntest = test.shape[0] SEED = 0 # for reproducibility NFOLDS = 5 # set folds for out-of-fold prediction kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

    分類器封裝

    接下來,我們對Sklearn分類器進行一下封裝,便于我們后續(xù)直接調(diào)用:

    class SklearnHelper(object):def __init__(self, clf, seed=0, params=None):params['random_state'] = seedself.clf = clf(**params)def train(self, x_train, y_train):self.clf.fit(x_train, y_train)def predict(self, x):return self.clf.predict(x)def fit(self,x,y):return self.clf.fit(x,y)def feature_importances(self,x,y):print(self.clf.fit(x,y).feature_importances_)

    同樣,我們也對XGBoost分類器進行相關(guān)的封裝:

    def get_oof(clf, x_train, y_train, x_test):oof_train = np.zeros((ntrain,))oof_test = np.zeros((ntest,))oof_test_skf = np.empty((NFOLDS, ntest))for i, (train_index, test_index) in enumerate(kf):x_tr = x_train[train_index]y_tr = y_train[train_index]x_te = x_train[test_index]clf.train(x_tr, y_tr)oof_train[test_index] = clf.predict(x_te)oof_test_skf[i, :] = clf.predict(x_test)oof_test[:] = oof_test_skf.mean(axis=0)return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

    下面,我們使用5種模型對其進行分類:

    模型構(gòu)建

    首先是對模型的參數(shù)進行設(shè)置:

    ## 模型參數(shù)設(shè)置 # 1. Random Forest rf_params = {'n_jobs':-1,'n_estimators':500,'warm_start':True,# 'max_features':0.2,'max_depth':6,'min_samples_leaf':2,'max_features':'sqrt','verbose':0 } # 2. Extra Trees et_params = {'n_jobs':-1,'n_estimators':500,# 'max_features':0.5,'max_depth':8,'min_samples_leaf':2,'verbose':0 } # 3. Adaboost ada_params = {'n_estimators': 500,'learning_rate':0.75 } # 4. Gradint Boosting gb_params = {'n_estimators':500,'max_features':0.2,'max_depth':5,'min_samples_leaf':2,'verbose':0 } # 5. SVM svc_params = {'kernel' : 'linear','C':0.025 }

    下面,根據(jù)設(shè)置的參數(shù)來創(chuàng)建模型對象:

    # 模型構(gòu)建 rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params) et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params) ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params) gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params) svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

    接下來,將我們的數(shù)據(jù)轉(zhuǎn)換為模型需要的numpy數(shù)組的格式:

    # 模型構(gòu)建 y_train = train['Survived'].ravel() train = train.drop(['Survived'], axis=1) x_train = train.values # Creates an array of the train data x_test = test.values # Creats an array of the test data

    接下來,我們用5個模型分別用于XGBoost訓(xùn)練模型中進行訓(xùn)練預(yù)測:

    et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifierprint("Training is complete")

    我們將每個模型中的特征提取出來:

    rf_feature = rf.feature_importances(x_train,y_train) et_feature = et.feature_importances(x_train, y_train) ada_feature = ada.feature_importances(x_train, y_train) gb_feature = gb.feature_importances(x_train,y_train)


    整理得到的特征值如下:

    cols = train.columns.values # Create a dataframe with features feature_dataframe = pd.DataFrame({'features': cols,'Random Forest feature importances': rf_features,'Extra Trees feature importances': et_features,'AdaBoost feature importances': ada_features,'Gradient Boost feature importances': gb_features })

    用圖像的方式可以更加明顯的表現(xiàn)出來:

    trace = go.Scatter(y = feature_dataframe['Random Forest feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['Random Forest feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'Random Forest Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')# Scatter plot trace = go.Scatter(y = feature_dataframe['Extra Trees feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['Extra Trees feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'Extra Trees Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')# Scatter plot trace = go.Scatter(y = feature_dataframe['AdaBoost feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['AdaBoost feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'AdaBoost Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')# Scatter plot trace = go.Scatter(y = feature_dataframe['Gradient Boost feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['Gradient Boost feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'Gradient Boosting Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')





    下面,我們在數(shù)據(jù)中在新增一列,用于添加這個特征值的平均值:

    feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise # 看一下目前的數(shù)據(jù)格式吧: feature_dataframe.head(3)


    來用柱狀圖看下每個特征的重要程度吧:

    y = feature_dataframe['mean'].values x = feature_dataframe['features'].values data = [go.Bar(x= x,y= y,width = 0.5,marker=dict(color = feature_dataframe['mean'].values,colorscale='Portland',showscale=True,reversescale = False),opacity=0.6)]layout= go.Layout(autosize= True,title= 'Barplots of Mean Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='bar-direct-labels')


    預(yù)測一下看看吧:

    base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),'ExtraTrees': et_oof_train.ravel(),'AdaBoost': ada_oof_train.ravel(),'GradientBoost': gb_oof_train.ravel()}) base_predictions_train.head()


    最后,我們再看下這四個特征的相關(guān)性圖吧:

    data = [go.Heatmap(z= base_predictions_train.astype(float).corr().values ,x=base_predictions_train.columns.values,y= base_predictions_train.columns.values,colorscale='Viridis',showscale=True,reversescale = True) ] py.iplot(data, filename='labelled-heatmap')


    最后,我們來對測試集的數(shù)據(jù)進行預(yù)測一下吧,并生成最終需要上傳的csv文件:

    x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1) x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)gbm = xgb.XGBClassifier(#learning_rate = 0.02,n_estimators= 2000,max_depth= 4,min_child_weight= 2,#gamma=1,gamma=0.9, subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread= -1,scale_pos_weight=1).fit(x_train, y_train) predictions = gbm.predict(x_test)# Generate Submission File StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,'Survived': predictions }) StackingSubmission.to_csv("StackingSubmission.csv", index=False)

    運行完成后,我們可以看到我們成功的得到了StackingSubmission.csv文件。

    這個文件的格式就是Kaggle競賽要求的結(jié)果文件啦!

    怎么樣?對Kaggle比賽是不是有了一點點的了解了?趕緊參與其中吧!

    總結(jié)

    以上是生活随笔為你收集整理的kaggle入门竞赛之泰坦尼克事故存活预测(xgboost方法)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。