當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

tidymodels绘制校准曲线

發布時間：2023/12/14 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了 tidymodels绘制校准曲线小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文首發于公眾號：醫學和生信筆記

“

醫學和生信筆記，專注R語言在臨床醫學中的使用，R語言數據分析和可視化。主要分享R語言做醫學統計學、meta分析、網絡藥理學、臨床預測模型、機器學習、生物信息學等。

很多人都開始用tidymodels了，但是很多人還沒意識到，tidymodels目前還不支持一鍵繪制校準曲線！相同類型的mlr3也是不支持的，都說在開發中！開發了1年多了，還沒開發好！

大家可以去項目的github相關的issue里面留言，引起開發者重視。。。

總的來說，在臨床預測模型這個領域，目前還是一些分散的R包更好用，尤其是涉及到時間依賴性的生存數據時，tidymodels和mlr3目前還無法滿足大家的需求~

但是很多朋友想要用這倆包畫校準曲線曲線，其實還是可以搞一下的，挺簡單的，之前介紹過很多次了，校準曲線就是散點圖，橫坐標是預測概率，縱坐標是實際概率（換過來也行！）。不理解的趕緊看這里：一文搞定臨床預測模型評價

今天先介紹下tidymodels的校準曲線畫法，之前也介紹過：使用tidymodels完成多個模型評價和比較

加載數據和R包

沒有安裝的R包的自己安裝下~

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
tidymodels_prefer()

由于要做演示用，肯定要一份比較好的數據才能說明問題，今天用的這份數據，結果變量是一個二分類的。

一共有91976行，26列，其中play_type是結果變量，因子型，其余列都是預測變量。

all_plays?<-?read_rds("../000files/all_plays.rds")
glimpse(all_plays)
##?Rows:?91,976
##?Columns:?26
##?$?game_id????????????????????<dbl>?2017090700,?2017090700,?2017090700,?2017090…
##?$?posteam????????????????????<chr>?"NE",?"NE",?"NE",?"NE",?"NE",?"NE",?"NE",?"…
##?$?play_type??????????????????<fct>?pass,?pass,?run,?run,?pass,?run,?pass,?pass…
##?$?yards_gained???????????????<dbl>?0,?8,?8,?3,?19,?5,?16,?0,?2,?7,?0,?3,?10,?0…
##?$?ydstogo????????????????????<dbl>?10,?10,?2,?10,?7,?10,?5,?2,?2,?10,?10,?10,?…
##?$?down???????????????????????<ord>?1,?2,?3,?1,?2,?1,?2,?1,?2,?1,?1,?2,?3,?1,?2…
##?$?game_seconds_remaining?????<dbl>?3595,?3589,?3554,?3532,?3506,?3482,?3455,?3…
##?$?yardline_100???????????????<dbl>?73,?73,?65,?57,?54,?35,?30,?2,?2,?75,?32,?3…
##?$?qtr????????????????????????<ord>?1,?1,?1,?1,?1,?1,?1,?1,?1,?1,?1,?1,?1,?1,?1…
##?$?posteam_score??????????????<dbl>?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?7,?7,?7,?7,?7…
##?$?defteam????????????????????<chr>?"KC",?"KC",?"KC",?"KC",?"KC",?"KC",?"KC",?"…
##?$?defteam_score??????????????<dbl>?0,?0,?0,?0,?0,?0,?0,?0,?0,?7,?0,?0,?0,?0,?0…
##?$?score_differential?????????<dbl>?0,?0,?0,?0,?0,?0,?0,?0,?0,?-7,?7,?7,?7,?7,?…
##?$?shotgun????????????????????<fct>?0,?0,?1,?1,?1,?0,?1,?0,?0,?1,?1,?1,?1,?1,?0…
##?$?no_huddle??????????????????<fct>?0,?0,?0,?1,?1,?1,?1,?0,?0,?0,?0,?0,?0,?0,?0…
##?$?posteam_timeouts_remaining?<fct>?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3…
##?$?defteam_timeouts_remaining?<fct>?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3,?3…
##?$?wp?????????????????????????<dbl>?0.5060180,?0.4840546,?0.5100098,?0.5529816,…
##?$?goal_to_go?????????????????<fct>?0,?0,?0,?0,?0,?0,?0,?1,?1,?0,?0,?0,?0,?0,?0…
##?$?half_seconds_remaining?????<dbl>?1795,?1789,?1754,?1732,?1706,?1682,?1655,?1…
##?$?total_runs?????????????????<dbl>?0,?0,?0,?1,?2,?2,?3,?3,?3,?0,?4,?4,?4,?5,?5…
##?$?total_pass?????????????????<dbl>?0,?1,?2,?2,?2,?3,?3,?4,?5,?0,?5,?6,?7,?7,?8…
##?$?previous_play??????????????<fct>?First?play?of?Drive,?pass,?pass,?run,?run,?…
##?$?in_red_zone????????????????<fct>?0,?0,?0,?0,?0,?0,?0,?1,?1,?0,?0,?0,?0,?1,?1…
##?$?in_fg_range????????????????<fct>?0,?0,?0,?0,?0,?1,?1,?1,?1,?0,?1,?1,?1,?1,?1…
##?$?two_min_drill??????????????<fct>?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0…

數據劃分

把75%的數據用于訓練集，剩下的做測試集。

set.seed(20220520)

#?數據劃分，根據play_type分層
split_pbp?<-?initial_split(all_plays,?0.75,?strata?=?play_type)

train_data?<-?training(split_pbp)?#?訓練集
test_data?<-?testing(split_pbp)?#?測試集

數據預處理

使用recipe包進行數據預處理，如果你認真學習過caret，那這個包你應該不陌生。

pbp_rec?<-?recipe(play_type?~?.,?data?=?train_data)??%>%
??step_rm(half_seconds_remaining,yards_gained,?game_id)?%>%?#?移除這3列
??step_string2factor(posteam,?defteam)?%>%??#?變為因子類型
??#update_role(yards_gained,?game_id,?new_role?=?"ID")?%>%?
??#?去掉高度相關的變量
??step_corr(all_numeric(),?threshold?=?0.7)?%>%?
??step_center(all_numeric())?%>%??#?中心化
??step_zv(all_predictors())??#?去掉零方差變量

建立模型

就以經常用的隨機森林進行演示，這里就不演示調參了，因為也不一定比默認參數的結果好......

選擇隨機森林，建立workflow：

rf_spec?<-?rand_forest(mode?=?"classification")?%>%?
??set_engine("ranger",importance?=?"permutation")
rf_wflow?<-?workflow()?%>%?
??add_recipe(pbp_rec)?%>%?
??add_model(rf_spec)

在訓練集建模：

fit_rf?<-?rf_wflow?%>%?
??fit(train_data)

模型評價

應用于測試集：

pred_rf?<-?test_data?%>%?select(play_type)?%>%?
??bind_cols(predict(fit_rf,?test_data,?type?=?"prob"))?%>%?
??bind_cols(predict(fit_rf,?test_data,?type?=?"class"))

這個pred_rf就是接下來一系列操作的基礎，非常重要！！

head(pred_rf)
##?#?A?tibble:?6?×?4
##???play_type?.pred_pass?.pred_run?.pred_class
##???<fct>??????????<dbl>?????<dbl>?<fct>??????
##?1?pass???????????0.312?????0.688?run????????
##?2?pass???????????0.829?????0.171?pass???????
##?3?pass???????????0.806?????0.194?pass???????
##?4?pass???????????0.678?????0.322?pass???????
##?5?run????????????0.184?????0.816?run????????
##?6?run????????????0.544?????0.456?pass

查看模型表現：

你知道的又或者不知道的指標基本上都有：

metricsets?<-?metric_set(accuracy,?mcc,?f_meas,?j_index)

pred_rf?%>%?metricsets(truth?=?play_type,?estimate?=?.pred_class)
##?#?A?tibble:?4?×?3
##???.metric??.estimator?.estimate
##???<chr>????<chr>??????????<dbl>
##?1?accuracy?binary?????????0.731
##?2?mcc??????binary?????????0.441
##?3?f_meas???binary?????????0.774
##?4?j_index??binary?????????0.439

混淆矩陣：

pred_rf?%>%?conf_mat(truth?=?play_type,?estimate?=?.pred_class)
##???????????Truth
##?Prediction??pass???run
##???????pass?10622??3226
##???????run???2962??6185

混淆矩陣圖形版：

pred_rf?%>%?
??conf_mat(play_type,.pred_class)?%>%?
??autoplot()
plot of chunk unnamed-chunk-11

大家最喜歡的AUC：

pred_rf?%>%?roc_auc(truth?=?play_type,?.pred_pass)
##?#?A?tibble:?1?×?3
##???.metric?.estimator?.estimate
##???<chr>???<chr>??????????<dbl>
##?1?roc_auc?binary?????????0.799

可視化結果，首先是大家喜聞樂見的ROC曲線：

pred_rf?%>%?roc_curve(truth?=?play_type,?.pred_pass)?%>%?
??autoplot()
plot of chunk unnamed-chunk-13

pr曲線：

pred_rf?%>%?pr_curve(truth?=?play_type,?.pred_pass)?%>%?
??autoplot()
plot of chunk unnamed-chunk-14

gain_curve：

pred_rf?%>%?gain_curve(truth?=?play_type,?.pred_pass)?%>%?
??autoplot()
plot of chunk unnamed-chunk-15

lift_curve：

pred_rf?%>%?lift_curve(truth?=?play_type,?.pred_pass)?%>%?
??autoplot()
plot of chunk unnamed-chunk-16

就是沒有校準曲線！！

校準曲線

下面給大家手動畫一個校準曲線。

兩種畫法，差別不大，主要是分組方法不一樣，第2種分組方法是大家常見的哦~

如果你還不懂為什么我說校準曲線是散點圖，建議你先看看一些基礎知識：x一文搞定臨床預測模型的評價，看了不吃虧。

calibration_df?<-?pred_rf?%>%?
???mutate(pass?=?if_else(play_type?==?"pass",?1,?0),
??????????pred_rnd?=?round(.pred_pass,?2)
??????????)?%>%?
??group_by(pred_rnd)?%>%?
??summarize(mean_pred?=?mean(.pred_pass),
????????????mean_obs?=?mean(pass),
????????????n?=?n()
????????????)

ggplot(calibration_df,?aes(mean_pred,?mean_obs))+?
??geom_point(aes(size?=?n),?alpha?=?0.5)+
??geom_abline(linetype?=?"dashed")+
??theme_minimal()
plot of chunk unnamed-chunk-17

第2種方法：

cali_df?<-?pred_rf?%>%?
??arrange(.pred_pass)?%>%?
??mutate(pass?=?if_else(play_type?==?"pass",?1,?0),
?????????group?=?c(rep(1:249,each=92),?rep(250,87))
?????????)?%>%?
??group_by(group)?%>%?
??summarise(mean_pred?=?mean(.pred_pass),
????????????mean_obs?=?mean(pass)
????????????)

cali_plot?<-?ggplot(cali_df,?aes(mean_pred,?mean_obs))+?
??geom_point(alpha?=?0.5)+
??geom_abline(linetype?=?"dashed")+
??theme_minimal()

cali_plot
plot of chunk unnamed-chunk-18

兩種方法差別不大，效果都是很好的，這就說明，好就是好，不管你用什么方法，都是好！如果你的數據很爛，那大概率你的結果也是很爛！不管用什么方法都是爛！

最后，隨機森林這種方法是可以計算變量重要性的，當然也是能把結果可視化的。

順手給大家演示下如何可視化隨機森林結果的變量重要性：

library(vip)

fit_rf?%>%?
??extract_fit_parsnip()?%>%?
??vip(num_features?=?10)
plot of chunk unnamed-chunk-19 “

所以，校準曲線的畫法，你學會了嗎？

有問題歡迎評論區留言！

加群即可免費獲得示例數據！

本文首發于公眾號：醫學和生信筆記

“

本文由 mdnice 多平臺發布

總結

以上是生活随笔為你收集整理的tidymodels绘制校准曲线的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：起点中文网字体反爬
下一篇： easyUI之新增，下架以及上架