日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习(二) 如何做到Kaggle排名前2%

發(fā)布時(shí)間:2025/3/21 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 机器学习(二) 如何做到Kaggle排名前2% 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
 機(jī)器學(xué)習(xí)(二) 如何做到Kaggle排名前2% 2017-04-12 2017-05-06 0 1237 11,517 本文詳述了如何通過數(shù)據(jù)預(yù)覽,探索式數(shù)據(jù)分析,缺失數(shù)據(jù)填補(bǔ),刪除關(guān)聯(lián)特征以及派生新特征等方法,在Kaggle的Titanic幸存預(yù)測這一分類問題競賽中獲得前2%排名的具體方法。

原創(chuàng)文章,轉(zhuǎn)載請務(wù)必將下面這段話置于文章開頭處。
本文轉(zhuǎn)發(fā)自技術(shù)世界,原文鏈接 http://www.jasongj.com/ml/classification/

摘要

本文詳述了如何通過數(shù)據(jù)預(yù)覽,探索式數(shù)據(jù)分析,缺失數(shù)據(jù)填補(bǔ),刪除關(guān)聯(lián)特征以及派生新特征等方法,在Kaggle的Titanic幸存預(yù)測這一分類問題競賽中獲得前2%排名的具體方法。

競賽內(nèi)容介紹

Titanic幸存預(yù)測是Kaggle上參賽人數(shù)最多的競賽之一。它要求參賽選手通過訓(xùn)練數(shù)據(jù)集分析出什么類型的人更可能幸存,并預(yù)測出測試數(shù)據(jù)集中的所有乘客是否生還。

該項(xiàng)目是一個(gè)二元分類問題

如何取得排名前2%的成績

加載數(shù)據(jù)

在加載數(shù)據(jù)之前,先通過如下代碼加載之后會用到的所有R庫

1234567891011121314151617 library(readr) # File read / writelibrary(ggplot2) # Data visualizationlibrary(ggthemes) # Data visualizationlibrary(scales) # Data visualizationlibrary(plyr)library(stringr) # String manipulationlibrary(InformationValue) # IV / WOE calculationlibrary(MLmetrics) # Mache learning metrics.e.g. Recall, Precision, Accuracy, AUClibrary(rpart) # Decision tree utilslibrary(randomForest) # Random Forestlibrary(dplyr) # Data manipulationlibrary(e1071) # SVMlibrary(Amelia) # Missing value utilslibrary(party) # Conditional inference treeslibrary(gbm) # AdaBoostlibrary(class) # KNNlibrary(scales)

通過如下代碼將訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)分別加載到名為train和test的data.frame中

12 train <- read_csv("train.csv")test <- read_csv("test.csv")

由于之后需要對訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)做相同的轉(zhuǎn)換,為避免重復(fù)操作和出現(xiàn)不一至的情況,更為了避免可能碰到的Categorical類型新level的問題,這里建議將訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)合并,統(tǒng)一操作。

123 data <- bind_rows(train, test)train.row <- 1:nrow(train)test.row <- (1 + nrow(train)):(nrow(train) + nrow(test))

數(shù)據(jù)預(yù)覽

先觀察數(shù)據(jù)

1 str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1309 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... ## $ Sex : chr "male" "female" "female" "female" ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : chr NA "C85" NA "C123" ... ## $ Embarked : chr "S" "C" "S" "S" ...

從上可見,數(shù)據(jù)集包含12個(gè)變量,1309條數(shù)據(jù),其中891條為訓(xùn)練數(shù)據(jù),418條為測試數(shù)據(jù)

  • PassengerId 整型變量,標(biāo)識乘客的ID,遞增變量,對預(yù)測無幫助
  • Survived 整型變量,標(biāo)識該乘客是否幸存。0表示遇難,1表示幸存。將其轉(zhuǎn)換為factor變量比較方便處理
  • Pclass 整型變量,標(biāo)識乘客的社會-經(jīng)濟(jì)狀態(tài),1代表Upper,2代表Middle,3代表Lower
  • Name 字符型變量,除包含姓和名以外,還包含Mr. Mrs. Dr.這樣的具有西方文化特點(diǎn)的信息
  • Sex 字符型變量,標(biāo)識乘客性別,適合轉(zhuǎn)換為factor類型變量
  • Age 整型變量,標(biāo)識乘客年齡,有缺失值
  • SibSp 整型變量,代表兄弟姐妹及配偶的個(gè)數(shù)。其中Sib代表Sibling也即兄弟姐妹,Sp代表Spouse也即配偶
  • Parch 整型變量,代表父母或子女的個(gè)數(shù)。其中Par代表Parent也即父母,Ch代表Child也即子女
  • Ticket 字符型變量,代表乘客的船票號
  • Fare 數(shù)值型,代表乘客的船票價(jià)
  • Cabin 字符型,代表乘客所在的艙位,有缺失值
  • Embarked 字符型,代表乘客登船口岸,適合轉(zhuǎn)換為factor型變量

探索式數(shù)據(jù)分析

乘客社會等級越高,幸存率越高

對于第一個(gè)變量Pclass,先將其轉(zhuǎn)換為factor類型變量。

1 data$Survived <- factor(data$Survived)

可通過如下方式統(tǒng)計(jì)出每個(gè)Pclass幸存和遇難人數(shù),如下

12345678 ggplot(data = data[1:nrow(train),], mapping = aes(x = Pclass, y = ..count.., fill=Survived)) + geom_bar(stat = "count", position='dodge') + xlab('Pclass') + ylab('Count') + ggtitle('How Pclass impact survivor') + scale_fill_manual(values=c("#FF0000", "#00FF00")) +geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,Pclass=1的乘客大部分幸存,Pclass=2的乘客接近一半幸存,而Pclass=3的乘客只有不到25%幸存。

為了更為定量的計(jì)算Pclass的預(yù)測價(jià)值,可以算出Pclass的WOE和IV如下。從結(jié)果可以看出,Pclass的IV為0.5,且“Highly Predictive”。由此可以暫時(shí)將Pclass作為預(yù)測模型的特征變量之一。

1 WOETable(X=factor(data$Pclass[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 1 136 80 216 0.3976608 0.1457195 1.0039160 0.25292792 ## 2 2 87 97 184 0.2543860 0.1766849 0.3644848 0.02832087 ## 3 3 119 372 491 0.3479532 0.6775956 -0.6664827 0.21970095
1 IV(X=factor(data$Pclass[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.5009497 ## attr(,"howgood") ## [1] "Highly Predictive"

不同Title的乘客幸存率不同

乘客姓名重復(fù)度太低,不適合直接使用。而姓名中包含Mr. Mrs. Dr.等具有文化特征的信息,可將之抽取出來。

本文使用如下方式從姓名中抽取乘客的Title

123456 data$Title <- sapply(data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})data$Title <- sub(' ', '', data$Title)data$Title[data$Title %in% c('Mme', 'Mlle')] <- 'Mlle'data$Title[data$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'data$Title[data$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'data$Title <- factor(data$Title)

抽取完乘客的Title后,統(tǒng)計(jì)出不同Title的乘客的幸存與遇難人數(shù)

12345678 ggplot(data = data[1:nrow(train),], mapping = aes(x = Title, y = ..count.., fill=Survived)) + geom_bar(stat = "count", position='stack') + xlab('Title') + ylab('Count') + ggtitle('How Title impact survivor') + scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可看出,Title為Mr的乘客幸存比例非常小,而Title為Mrs和Miss的乘客幸存比例非常大。這里使用WOE和IV來定量計(jì)算Title這一變量對于最終的預(yù)測是否有用。從計(jì)算結(jié)果可見,IV為1.520702,且”Highly Predictive”。因此,可暫將Title作為預(yù)測模型中的一個(gè)特征變量。

1 WOETable(X=data$Title[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 Col 1 1 2 0.002873563 0.001808318 0.46315552 4.933741e-04 ## 2 Dr 3 4 7 0.008620690 0.007233273 0.17547345 2.434548e-04 ## 3 Lady 2 1 3 0.005747126 0.001808318 1.15630270 4.554455e-03 ## 4 Master 23 17 40 0.066091954 0.030741410 0.76543639 2.705859e-02 ## 5 Miss 127 55 182 0.364942529 0.099457505 1.30000942 3.451330e-01 ## 6 Mlle 3 3 3 0.008620690 0.005424955 0.46315552 1.480122e-03 ## 7 Mr 81 436 517 0.232758621 0.788426763 -1.22003757 6.779360e-01 ## 8 Mrs 99 26 125 0.284482759 0.047016275 1.80017883 4.274821e-01 ## 9 Ms 1 1 1 0.002873563 0.001808318 0.46315552 4.933741e-04 ## 10 Rev 6 6 6 0.017241379 0.010849910 0.46315552 2.960244e-03 ## 11 Sir 2 3 5 0.005747126 0.005424955 0.05769041 1.858622e-05
1 IV(X=data$Title[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## [1] 1.487853 ## attr(,"howgood") ## [1] "Highly Predictive"

女性幸存率遠(yuǎn)高于男性

對于Sex變量,由Titanic號沉沒的背景可知,逃生時(shí)遵循“婦女與小孩先走”的規(guī)則,由此猜想,Sex變量應(yīng)該對預(yù)測乘客幸存有幫助。

如下數(shù)據(jù)驗(yàn)證了這一猜想,大部分女性(233/(233+81)=74.20%)得以幸存,而男性中只有很小部分(109/(109+468)=22.85%)幸存。

12345678 data$Sex <- as.factor(data$Sex)ggplot(data = data[1:nrow(train),], mapping = aes(x = Sex, y = ..count.., fill=Survived)) + geom_bar(stat = 'count', position='dodge') + xlab('Sex') + ylab('Count') + ggtitle('How Sex impact survivo') + geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

通過計(jì)算WOE和IV可知,Sex的IV為1.34且”Highly Predictive”,可暫將Sex作為特征變量。

1 WOETable(X=data$Sex[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 female 233 81 314 0.6812865 0.147541 1.5298770 0.8165651 ## 2 male 109 468 577 0.3187135 0.852459 -0.9838327 0.5251163
1 IV(X=data$Sex[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## [1] 1.341681 ## attr(,"howgood") ## [1] "Highly Predictive"

未成年人幸存率高于成年人

結(jié)合背景,按照“婦女與小孩先走”的規(guī)則,未成年人應(yīng)該有更大可能幸存。如下圖所示,Age < 18的乘客中,幸存人數(shù)確實(shí)高于遇難人數(shù)。同時(shí)青壯年乘客中,遇難人數(shù)遠(yuǎn)高于幸存人數(shù)。

123 ggplot(data = data[(!is.na(data$Age)) & row(data[, 'Age']) <= 891, ], aes(x = Age, color=Survived)) + geom_line(aes(label=..count..), stat = 'bin', binwidth=5) + labs(title = "How Age impact survivor", x = "Age", y = "Count", fill = "Survived")
## Warning: Ignoring unknown aesthetics: label

配偶及兄弟姐妹數(shù)適中的乘客更易幸存

對于SibSp變量,分別統(tǒng)計(jì)出幸存與遇難人數(shù)。

12345 ggplot(data = data[1:nrow(train),], mapping = aes(x = SibSp, y = ..count.., fill=Survived)) + geom_bar(stat = 'count', position='dodge') + labs(title = "How SibSp impact survivor", x = "Sibsp", y = "Count", fill = "Survived") + geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,SibSp為0的乘客,幸存率低于1/3;SibSp為1或2的乘客,幸存率高于50%;SibSp大于等于3的乘客,幸存率非常低。可通過計(jì)算WOE與IV定量計(jì)算SibSp對預(yù)測的貢獻(xiàn)。IV為0.1448994,且”Highly Predictive”。

1 WOETable(X=as.factor(data$SibSp[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 0 210 398 608 0.593220339 0.724954463 -0.2005429 0.026418349 ## 2 1 112 97 209 0.316384181 0.176684882 0.5825894 0.081387334 ## 3 2 13 15 28 0.036723164 0.027322404 0.2957007 0.002779811 ## 4 3 4 12 16 0.011299435 0.021857923 -0.6598108 0.006966604 ## 5 4 3 15 18 0.008474576 0.027322404 -1.1706364 0.022063953 ## 6 5 5 5 5 0.014124294 0.009107468 0.4388015 0.002201391 ## 7 8 7 7 7 0.019774011 0.012750455 0.4388015 0.003081947
1 IV(X=as.factor(data$SibSp[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1448994 ## attr(,"howgood") ## [1] "Highly Predictive"

父母與子女?dāng)?shù)為1到3的乘客更可能幸存

對于Parch變量,分別統(tǒng)計(jì)出幸存與遇難人數(shù)。

12345 ggplot(data = data[1:nrow(train),], mapping = aes(x = Parch, y = ..count.., fill=Survived)) + geom_bar(stat = 'count', position='dodge') + labs(title = "How Parch impact survivor", x = "Parch", y = "Count", fill = "Survived") + geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,Parch為0的乘客,幸存率低于1/3;Parch為1到3的乘客,幸存率高于50%;Parch大于等于4的乘客,幸存率非常低。可通過計(jì)算WOE與IV定量計(jì)算Parch對預(yù)測的貢獻(xiàn)。IV為0.1166611,且”Highly Predictive”。

1 WOETable(X=as.factor(data$Parch[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 0 233 445 678 0.671469741 0.810564663 -0.1882622 0.026186312 ## 2 1 65 53 118 0.187319885 0.096539162 0.6628690 0.060175728 ## 3 2 40 40 80 0.115273775 0.072859745 0.4587737 0.019458440 ## 4 3 3 2 5 0.008645533 0.003642987 0.8642388 0.004323394 ## 5 4 4 4 4 0.011527378 0.007285974 0.4587737 0.001945844 ## 6 5 1 4 5 0.002881844 0.007285974 -0.9275207 0.004084922 ## 7 6 1 1 1 0.002881844 0.001821494 0.4587737 0.000486461
1 IV(X=as.factor(data$Parch[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1166611 ## attr(,"howgood") ## [1] "Highly Predictive"

FamilySize為2到4的乘客幸存可能性較高

SibSp與Parch都說明,當(dāng)乘客無親人時(shí),幸存率較低,乘客有少數(shù)親人時(shí),幸存率高于50%,而當(dāng)親人數(shù)過高時(shí),幸存率反而降低。在這里,可以考慮將SibSp與Parch相加,生成新的變量,FamilySize。

12345678 data$FamilySize <- data$SibSp + data$Parch + 1ggplot(data = data[1:nrow(train),], mapping = aes(x = FamilySize, y = ..count.., fill=Survived)) + geom_bar(stat = 'count', position='dodge') + xlab('FamilySize') + ylab('Count') + ggtitle('How FamilySize impact survivor') + geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

計(jì)算FamilySize的WOE和IV可知,IV為0.3497672,且“Highly Predictive”。由SibSp與Parch派生出來的新變量FamilySize的IV高于SibSp與Parch的IV,因此,可將這個(gè)派生變量FamilySize作為特征變量。

1 WOETable(X=as.factor(data$FamilySize[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 1 163 374 537 0.459154930 0.68123862 -0.3945249 0.0876175539 ## 2 2 89 72 161 0.250704225 0.13114754 0.6479509 0.0774668616 ## 3 3 59 43 102 0.166197183 0.07832423 0.7523180 0.0661084057 ## 4 4 21 8 29 0.059154930 0.01457195 1.4010615 0.0624634998 ## 5 5 3 12 15 0.008450704 0.02185792 -0.9503137 0.0127410643 ## 6 6 3 19 22 0.008450704 0.03460838 -1.4098460 0.0368782940 ## 7 7 4 8 12 0.011267606 0.01457195 -0.2571665 0.0008497665 ## 8 8 6 6 6 0.016901408 0.01092896 0.4359807 0.0026038712 ## 9 11 7 7 7 0.019718310 0.01275046 0.4359807 0.0030378497
1 IV(X=as.factor(data$FamilySize[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.3497672 ## attr(,"howgood") ## [1] "Highly Predictive"

共票號乘客幸存率高

對于Ticket變量,重復(fù)度非常低,無法直接利用。先統(tǒng)計(jì)出每張票對應(yīng)的乘客數(shù)。

1 ticket.count <- aggregate(data$Ticket, by = list(data$Ticket), function(x) sum(!is.na(x)))

這里有個(gè)猜想,票號相同的乘客,是一家人,很可能同時(shí)幸存或者同時(shí)遇難。現(xiàn)將所有乘客按照Ticket分為兩組,一組是使用單獨(dú)票號,另一組是與他人共享票號,并統(tǒng)計(jì)出各組的幸存與遇難人數(shù)。

123456789 data$TicketCount <- apply(data, 1, function(x) ticket.count[which(ticket.count[, 1] == x['Ticket']), 2])data$TicketCount <- factor(sapply(data$TicketCount, function(x) ifelse(x > 1, 'Share', 'Unique')))ggplot(data = data[1:nrow(train),], mapping = aes(x = TicketCount, y = ..count.., fill=Survived)) + geom_bar(stat = 'count', position='dodge') + xlab('TicketCount') + ylab('Count') + ggtitle('How TicketCount impact survivor') + geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

由上圖可見,未與他人同票號的乘客,只有130/(130+351)=27%幸存,而與他人同票號的乘客有212/(212+198)=51.7%幸存。計(jì)算TicketCount的WOE與IV如下。其IV為0.2751882,且”Highly Predictive”

1 WOETable(X=data$TicketCount[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 Share 212 198 410 0.619883 0.3606557 0.5416069 0.1403993 ## 2 Unique 130 351 481 0.380117 0.6393443 -0.5199641 0.1347889
1 IV(X=data$TicketCount[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## [1] 0.2751882 ## attr(,"howgood") ## [1] "Highly Predictive"

支出船票費(fèi)越高幸存率越高

對于Fare變量,由下圖可知,Fare越大,幸存率越高。

123 ggplot(data = data[(!is.na(data$Fare)) & row(data[, 'Fare']) <= 891, ], aes(x = Fare, color=Survived)) + geom_line(aes(label=..count..), stat = 'bin', binwidth=10) + labs(title = "How Fare impact survivor", x = "Fare", y = "Count", fill = "Survived")

不同倉位的乘客幸存率不同

對于Cabin變量,其值以字母開始,后面伴以數(shù)字。這里有一個(gè)猜想,字母代表某個(gè)區(qū)域,數(shù)據(jù)代表該區(qū)域的序號。類似于火車票即有車箱號又有座位號。因此,這里可嘗試將Cabin的首字母提取出來,并分別統(tǒng)計(jì)出不同首字母倉位對應(yīng)的乘客的幸存率。

1234567 ggplot(data[1:nrow(train), ], mapping = aes(x = as.factor(sapply(data$Cabin[1:nrow(train)], function(x) str_sub(x, start = 1, end = 1))), y = ..count.., fill = Survived)) +geom_bar(stat = 'count', position='dodge') + xlab('Cabin') +ylab('Count') +ggtitle('How Cabin impact survivor') +geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

由上圖可見,倉位號首字母為B,C,D,E,F的乘客幸存率均高于50%,而其它倉位的乘客幸存率均遠(yuǎn)低于50%。倉位變量的WOE及IV計(jì)算如下。由此可見,Cabin的IV為0.1866526,且“Highly Predictive”

12 data$Cabin <- sapply(data$Cabin, function(x) str_sub(x, start = 1, end = 1))WOETable(X=as.factor(data$Cabin[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 A 7 8 15 0.05109489 0.11764706 -0.8340046 0.055504815 ## 2 B 35 12 47 0.25547445 0.17647059 0.3699682 0.029228917 ## 3 C 35 24 59 0.25547445 0.35294118 -0.3231790 0.031499197 ## 4 D 25 8 33 0.18248175 0.11764706 0.4389611 0.028459906 ## 5 E 24 8 32 0.17518248 0.11764706 0.3981391 0.022907100 ## 6 F 8 5 13 0.05839416 0.07352941 -0.2304696 0.003488215 ## 7 G 2 2 4 0.01459854 0.02941176 -0.7004732 0.010376267 ## 8 T 1 1 1 0.00729927 0.01470588 -0.7004732 0.005188134
1 IV(X=as.factor(data$Cabin[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1866526 ## attr(,"howgood") ## [1] "Highly Predictive"

Embarked為S的乘客幸存率較低

Embarked變量代表登船碼頭,現(xiàn)通過統(tǒng)計(jì)不同碼頭登船的乘客幸存率來判斷Embarked是否可用于預(yù)測乘客幸存情況。

1234567 ggplot(data[1:nrow(train), ], mapping = aes(x = Embarked, y = ..count.., fill = Survived)) +geom_bar(stat = 'count', position='dodge') + xlab('Embarked') +ylab('Count') +ggtitle('How Embarked impact survivor') +geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,Embarked為S的乘客幸存率僅為217/(217+427)=33.7%,而Embarked為C或?yàn)镹A的乘客幸存率均高于50%。初步判斷Embarked可用于預(yù)測乘客是否幸存。Embarked的WOE和IV計(jì)算如下。

1 WOETable(X=as.factor(data$Embarked[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV ## 1 C 93 75 168 0.27352941 0.1366120 0.6942642 9.505684e-02 ## 2 Q 30 47 77 0.08823529 0.0856102 0.0302026 7.928467e-05 ## 3 S 217 427 644 0.63823529 0.7777778 -0.1977338 2.759227e-02
1 IV(X=as.factor(data$Embarked[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1227284 ## attr(,"howgood") ## [1] "Highly Predictive"

從上述計(jì)算結(jié)果可見,IV為0.1227284,且“Highly Predictive”。

填補(bǔ)缺失值

列出所有缺失數(shù)據(jù)

1234567891011121314151617 attach(data)missing <- list(Pclass=nrow(data[is.na(Pclass), ]))missing$Name <- nrow(data[is.na(Name), ])missing$Sex <- nrow(data[is.na(Sex), ])missing$Age <- nrow(data[is.na(Age), ])missing$SibSp <- nrow(data[is.na(SibSp), ])missing$Parch <- nrow(data[is.na(Parch), ])missing$Ticket <- nrow(data[is.na(Ticket), ])missing$Fare <- nrow(data[is.na(Fare), ])missing$Cabin <- nrow(data[is.na(Cabin), ])missing$Embarked <- nrow(data[is.na(Embarked), ]) for (name in names(missing)) { if (missing[[name]][1] > 0) {print(paste('', name, ' miss ', missing[[name]][1], ' values', sep = ''))}}detach(data)
## [1] "Age miss 263 values" ## [1] "Fare miss 1 values" ## [1] "Cabin miss 1014 values" ## [1] "Embarked miss 2 values"

預(yù)測乘客年齡

缺失年齡信息的乘客數(shù)為263,缺失量比較大,不適合使用中位數(shù)或者平均值填補(bǔ)。一般通過使用其它變量預(yù)測或者直接將缺失值設(shè)置為默認(rèn)值的方法填補(bǔ),這里通過其它變量來預(yù)測缺失的年齡信息。

12 age.model <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data=data[!is.na(data$Age), ], method='anova')data$Age[is.na(data$Age)] <- predict(age.model, data[is.na(data$Age), ])

中位數(shù)填補(bǔ)缺失的Embarked值

從如下數(shù)據(jù)可見,缺失Embarked信息的乘客的Pclass均為1,且Fare均為80。

1 data[is.na(data$Embarked), c('PassengerId', 'Pclass', 'Fare', 'Embarked')]
## # A tibble: 2 × 4 ## PassengerId Pclass Fare Embarked ## <int> <int> <dbl> <chr> ## 1 62 1 80 <NA> ## 2 830 1 80 <NA>

由下圖所見,Embarked為C且Pclass為1的乘客的Fare中位數(shù)為80。

1234 ggplot(data[!is.na(data$Embarked),], aes(x=Embarked, y=Fare, fill=factor(Pclass))) +geom_boxplot() +geom_hline(aes(yintercept=80), color='red', linetype='dashed', lwd=2) +scale_y_continuous(labels=dollar_format()) + theme_few()

因此可以將缺失的Embarked值設(shè)置為’C’。

12 data$Embarked[is.na(data$Embarked)] <- 'C'data$Embarked <- as.factor(data$Embarked)

中位數(shù)填補(bǔ)一個(gè)缺失的Fare值

由于缺失Fare值的記錄非常少,一般可直接使用平均值或者中位數(shù)填補(bǔ)該缺失值。這里使用乘客的Fare中位數(shù)填補(bǔ)缺失值。

1 data$Fare[is.na(data$Fare)] <- median(data$Fare, na.rm=TRUE)

將缺失的Cabin設(shè)置為默認(rèn)值

缺失Cabin信息的記錄數(shù)較多,不適合使用中位數(shù)或者平均值填補(bǔ),一般通過使用其它變量預(yù)測或者直接將缺失值設(shè)置為默認(rèn)值的方法填補(bǔ)。由于Cabin信息不太容易從其它變量預(yù)測,并且在上一節(jié)中,將NA單獨(dú)對待時(shí),其IV已經(jīng)比較高。因此這里直接將缺失的Cabin設(shè)置為一個(gè)默認(rèn)值。

1 data$Cabin <- as.factor(sapply(data$Cabin, function(x) ifelse(is.na(x), 'X', str_sub(x, start = 1, end = 1))))

訓(xùn)練模型

12 set.seed(415)model <- cforest(Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + FamilySize + TicketCount + Fare + Cabin + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))

交叉驗(yàn)證

一般情況下,應(yīng)該將訓(xùn)練數(shù)據(jù)分為兩部分,一部分用于訓(xùn)練,另一部分用于驗(yàn)證。或者使用k-fold交叉驗(yàn)證。本文將所有訓(xùn)練數(shù)據(jù)都用于訓(xùn)練,然后隨機(jī)選取30%數(shù)據(jù)集用于驗(yàn)證。

1234567891011 cv.summarize <- function(data.true, data.predict) {print(paste('Recall:', Recall(data.true, data.predict)))print(paste('Precision:', Precision(data.true, data.predict)))print(paste('Accuracy:', Accuracy(data.predict, data.true)))print(paste('AUC:', AUC(data.predict, data.true)))}set.seed(415)cv.test.sample <- sample(1:nrow(train), as.integer(0.3 * nrow(train)), replace = TRUE)cv.test <- data[cv.test.sample,]cv.prediction <- predict(model, cv.test, OOB=TRUE, type = "response")cv.summarize(cv.test$Survived, cv.prediction)
## [1] "Recall: 0.947976878612717" ## [1] "Precision: 0.841025641025641" ## [1] "Accuracy: 0.850187265917603" ## [1] "AUC: 0.809094822285082"

預(yù)測

123 predict.result <- predict(model, data[(1+nrow(train)):(nrow(data)), ], OOB=TRUE, type = "response")output <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)write.csv(output, file = "cit1.csv", row.names = FALSE)

該模型預(yù)測結(jié)果在Kaggle的得分為0.80383,排第992名,前992/6292=15.8%。

調(diào)優(yōu)

去掉關(guān)聯(lián)特征

由于FamilySize結(jié)合了SibSp與Parch的信息,因此可以嘗試將SibSp與Parch從特征變量中移除。

12345 set.seed(415)model <- cforest(Survived ~ Pclass + Title + Sex + Age + FamilySize + TicketCount + Fare + Cabin + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)write.csv(submit, file = "cit2.csv", row.names = FALSE)

該模型預(yù)測結(jié)果在Kaggle的得分仍為0.80383。

去掉IV較低的Cabin

由于Cabin的IV值相對較低,因此可以考慮將其從模型中移除。

12345 set.seed(415)model <- cforest(Survived ~ Pclass + Title + Sex + Age + FamilySize + TicketCount + Fare + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)write.csv(submit, file = "cit3.csv", row.names = FALSE)

該模型預(yù)測結(jié)果在Kaggle的得分仍為0.80383。

增加派生特征

對于Name變量,上文從中派生出了Title變量。由于以下原因,可推測乘客的姓氏可能具有一定的預(yù)測作用

  • 部分西方國家中人名的重復(fù)度較高,而姓氏重復(fù)度較低,姓氏具有一定辨識度
  • 部分國家的姓氏具有一定的身份識別作用
  • 姓氏相同的乘客,可能是一家人(這一點(diǎn)也基于西方國家姓氏重復(fù)度較低這一特點(diǎn)),而一家人同時(shí)幸存或遇難的可能性較高

考慮到只出現(xiàn)一次的姓氏不可能同時(shí)出現(xiàn)在訓(xùn)練集和測試集中,不具辨識度和預(yù)測作用,因此將只出現(xiàn)一次的姓氏均命名為’Small’

123456789 data$Surname <- sapply(data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})data$FamilyID <- paste(as.character(data$FamilySize), data$Surname, sep="")data$FamilyID[data$FamilySize <= 2] <- 'Small'# Delete erroneous family IDsfamIDs <- data.frame(table(data$FamilyID))famIDs <- famIDs[famIDs$Freq <= 2,]data$FamilyID[data$FamilyID %in% famIDs$Var1] <- 'Small'# Convert to a factordata$FamilyID <- factor(data$FamilyID)
12345 set.seed(415)model <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)write.csv(submit, file = "cit4.csv", row.names = FALSE)

該模型預(yù)測結(jié)果在Kaggle的得分為0.82297,排第207名,前207/6292=3.3%

其它

經(jīng)試驗(yàn),將缺失的Embarked補(bǔ)充為出現(xiàn)最多的S而非C,成績有所提升。但該方法理論依據(jù)不強(qiáng),并且該成績只是Public排行榜成績,并非最終成績,并不能說明該方法一定優(yōu)于其它方法。因此本文并不推薦該方法,只是作為一種可能的思路,供大家參考學(xué)習(xí)。

12 data$Embarked[c(62,830)] = "S"data$Embarked <- factor(data$Embarked)

12345 set.seed(415)model <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)write.csv(submit, file = "cit5.csv", row.names = FALSE)

該模型預(yù)測結(jié)果在Kaggle的得分仍為0.82775,排第114名,前114/6292=1.8%

總結(jié)

本文詳述了如何通過數(shù)據(jù)預(yù)覽,探索式數(shù)據(jù)分析,缺失數(shù)據(jù)填補(bǔ),刪除關(guān)聯(lián)特征以及派生新特征等方法,在Kaggle的Titanic幸存預(yù)測這一分類問題競賽中獲得前2%排名的具體方法。

下篇預(yù)告

下一篇文章將側(cè)重講解使用機(jī)器學(xué)習(xí)解決工程問題的一般思路和方法。

總結(jié)

以上是生活随笔為你收集整理的机器学习(二) 如何做到Kaggle排名前2%的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。