當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

R语言回归中的Hosmer-Lemeshow以及calibration curve校正曲线

發(fā)布時間：2024/3/24 编程问答 54 豆豆

生活随笔收集整理的這篇文章主要介紹了 R语言回归中的Hosmer-Lemeshow以及calibration curve校正曲线小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Hosmer-Lemeshow test

詳細(xì)的Hosmer-Lemeshow test使用方法的介紹請參考：看這里

calibration curve是評價模型的一個重要指標(biāo)，不信的話請看文章1，文章2，總之很重要的！！！

最近需要畫校正曲線，但網(wǎng)上的相關(guān)資料實在是太少了，花了點時間研究了一下，跟大家分享一下如何使用Hosmer-Lemeshow test來繪制calibration curve，（p.s. 都是個人理解，可能有錯，歡迎討論）:

首先，按照方法介紹中的樣例，我們生成一下數(shù)據(jù)：

set.seed(43657) n <- 100 x <- rnorm(n) xb <- x pr <- exp(xb)/(1+exp(xb)) y <- 1*(runif(n) < pr)

這里的x是變量，y相當(dāng)于標(biāo)簽（0，1），這里做的是二分類的。然后我們放到glm函數(shù)中fit一下：

mod <- glm(y~x, family=binomial)

然后，將結(jié)果y和模型擬合概率傳遞給hoslem.test函數(shù)，選擇g = 10組。在這之前，要記得先安裝并加載package，安裝方法為install.paskages(‘ResourceSelection’)，加載方法為library(ResourceSelection)。

hl <- hoslem.test(mod$y, fitted(mod), g=10) hl

這里的mod$y和 fitted(mod)分別表示標(biāo)簽和預(yù)測概率，可以換成自己的數(shù)據(jù)。hl輸出為：

Hosmer and Lemeshow goodness of fit (GOF) testdata: mod$y, fitted(mod) X-squared = 7.4866, df = 8, p-value = 0.4851

我們還可以從hl對象中獲得一個觀察到的與預(yù)期的表：

cbind(hl$observed,hl$expected)

輸出為

y0 y1 yhat0 yhat1 [0.0868,0.219] 8 2 8.259898 1.740102 (0.219,0.287] 7 3 7.485661 2.514339 (0.287,0.329] 7 3 6.968185 3.031815 (0.329,0.421] 8 2 6.194245 3.805755 (0.421,0.469] 5 5 5.510363 4.489637 (0.469,0.528] 4 6 4.983951 5.016049 (0.528,0.589] 5 5 4.521086 5.478914 (0.589,0.644] 2 8 3.833244 6.166756 (0.644,0.713] 6 4 3.285271 6.714729 (0.713,0.913] 1 9 1.958095 8.041905

其中y0, y1表示觀察的結(jié)果，yhat0，yhat2表示預(yù)期的結(jié)果。為了理解這些數(shù)字都代表些什么，我們手動計算一下，首先，我們計算模型的預(yù)測概率，然后根據(jù)預(yù)測概率的十分位數(shù)對觀察結(jié)果進行分組（這里其實就是將模型預(yù)測的概率進行排序，然后將數(shù)據(jù)劃分成10組）：

pihat <- mod$fitted pihatcat <- cut(pihat, breaks=c(0,quantile(pihat, probs=seq(0.1,0.9,0.1)),1), labels=FALSE)

接下來，我們循環(huán)遍歷第1組到第10組，計算觀察到的0和1的數(shù)量，并計算期望的0和1的數(shù)量。
這里提一下，上面我們選擇了g=10，那么根據(jù)模型擬合出來的概率mod$fitted和十分位數(shù)，我們將所有數(shù)據(jù)分為10組，下面是本例中10組的分組情況，可以看到每個數(shù)據(jù)在哪個組別：

10 10 9 8 2 9 2 1 6 5 2 1 1 10 5 6 4 4 8 3 6 3 9 2 9 3 6 5 4 7 4 9 3 5 8 4 7 4 3 2 2 8 6 1 3 5 7 2 7 2 8 7 6 1 1 4 8 9 10 10 1 5 1 4 6 9 10 6 7 3 8 2 2 9 5 10 7 6 5 4 7 6 9 1 10 3 3 8 10 7 5 7 10 1 8 8 4 9 3 5

為了計算期望的0和1的數(shù)量，我們找到每個組中預(yù)測概率的平均值，然后將其乘以組大小，這里為10：

meanprobs <- array(0, dim=c(10,2)) expevents <- array(0, dim=c(10,2)) obsevents <- array(0, dim=c(10,2)) stdprobs <- array(0, dim=c(10,2)) obsprobs <- array(0, dim=c(10,2))for (i in 1:10) {meanprobs[i,1] <- mean(pihat[pihatcat==i])stdprobs[i,1] <- sd(pihat[pihatcat==i])expevents[i,1] <- sum(pihatcat==i)*meanprobs[i,1]obsevents[i,1] <- sum(y[pihatcat==i])obsprobs[i,1] <- sum(y[pihatcat==i])/sum(pihatcat==i)meanprobs[i,2] <- mean(1-pihat[pihatcat==i])stdprobs[i,2] <- sd(1-pihat[pihatcat==i])expevents[i,2] <- sum(pihatcat==i)*meanprobs[i,2]obsevents[i,2] <- sum(1-y[pihatcat==i])obsprobs[i,2] <- 1-(sum(y[pihatcat==i])/sum(pihatcat==i)) }

這里meanprobs 表示每組的平均概率，expevents 表示期望事件，obsevents 表示觀察事件，我加了兩個參數(shù)，方便繪制calibration curve，其中，stdprobs 表示每組的標(biāo)準(zhǔn)差，obsprobs 表示每組的觀察概率。
然后，我們可以通過上面計算得到的表格的10x2單元格中的（觀察到的預(yù)期）^ 2 /預(yù)期的總和來計算Hosmer-Lemeshow檢驗統(tǒng)計量：

hosmerlemeshow <- sum((obsevents-expevents)^2 / expevents) hosmerlemeshow

可以看到輸出為：

[1] 7.486643

和hoslem.test函數(shù)計算出來的一樣對吧。

calibration curve

現(xiàn)在，我們有了觀察的概率表obsprobs，和預(yù)測的平均概率meanprobs ，以及預(yù)測概率的方差stdprobs，可以繪制calibration curves了。

先繪制一個空圖，

yy <- meanprobs[,c(1)] xx <- obsprobs[,c(1)] plot(xx, yy, type = "n", xlab = "Predicted probability", ylab = "Oberserved probability")

然后，添加數(shù)據(jù)，

points(xx, yy, type = "p", pch = 19, col = "seagreen", lty = 1)

通過改變type的值可以選擇繪制散點圖還是折線圖，或者是別的類型的圖。然后，我們在圖上添加誤差棒

plot_error <- function(xx, yy, sd, len = 1, col = "black") {len <- len * 0.05arrows(x0 = xx, y0 = yy, x1 = xx, y1 = yy - sd, col = col, angle = 90, length = len)arrows(x0 = xx, y0 = yy, x1 = xx, y1 = yy + sd, col = col, angle = 90, length = len) } plot_error(xx, yy, sd = stdprobs[,c(1)], col = "seagreen")

最后，加上對角線和legend，就可以繪制出一張校正曲線圖啦。

abline(0,1,lty=3,lwd=1,col=c(rgb(0,0,0,maxColorValue=255)))labs <- c("RANDOM TEST") legend("top", legend = labs, cex = 0.8, lty = 1, lwd = 2, pch = 19, col = c("seagreen"), inset = 0.01, horiz = TRUE, box.col = "white")

P.S 這里使用的是簡單的glm模型來做模型得到的概率，如果有現(xiàn)成的概率和標(biāo)簽，也是可以直接繪制calibration curve的~

總結(jié)

以上是生活随笔為你收集整理的R语言回归中的Hosmer-Lemeshow以及calibration curve校正曲线的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： DotA_Allstars_v6.61b
下一篇：揭开瑞星弑毒软件的丑恶嘴脸——中国杀毒第