NHANES数据库的介绍及使用(二)
前一篇介紹了NHANES數(shù)據(jù)庫的加權(quán)及數(shù)據(jù)的下載NHANSE數(shù)據(jù)庫的介紹及使用(一)_Christina-CSDN博客,這一篇主要介紹數(shù)據(jù)庫如何導(dǎo)入軟件進(jìn)行下一步計(jì)算合并。
例一:
以NHANSE數(shù)據(jù)庫的文章為例(Brody DJ, Pratt LA, Hughes J. Prevalence of depression among adults aged 20 and over: United States, 2013-2016. NCHS Data Brief, no 303. Hyattsville, MD: National Center for Health Statistics. 2018.)
1.加載安裝包
library(dplyr) library(survey)2.下載數(shù)據(jù)
此步驟可以在官網(wǎng)上下載,或使用軟件下載。
# Download & Read SAS Transport Files # Demographic (DEMO) download.file("https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.XPT", tf <- tempfile(), mode="wb") DEMO_H <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")] download.file("https://wwwn.cdc.gov/nchs/nhanes/2015-2016/DEMO_I.XPT", tf <- tempfile(), mode="wb") DEMO_I <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")]# Mental Health - Depression Screener (DPQ) download.file("http://wwwn.cdc.gov/nchs/nhanes/2013-2014/DPQ_H.XPT", tf <- tempfile(), mode="wb") DPQ_H <- foreign::read.xport(tf) download.file("http://wwwn.cdc.gov/nchs/nhanes/2015-2016/DPQ_I.XPT", tf <- tempfile(), mode="wb") DPQ_I <- foreign::read.xport(tf)3.合并數(shù)據(jù)
# Append Files DEMO <- bind_rows(DEMO_H, DEMO_I) DPQ <- bind_rows(DPQ_H, DPQ_I)# Merge DEMO and DPQ files and create derived variablesOne <- left_join(DEMO, DPQ, by="SEQN") %>%# Set 7=Refused and 9=Don't Know To Missing for variables DPQ010 thru DPQ090 ##mutate_at(vars(DPQ010:DPQ090), ~ifelse(. >=7, NA, .)) %>%mutate(. , # create indicator for overall summaryone = 1,# Create depression score as sum of variables DPQ010 -- DPQ090Depression.Score = rowSums(select(. , DPQ010:DPQ090)),# Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)Depression= ifelse(Depression.Score >=10, 100, 0), # Create factor variablesGender = factor(RIAGENDR, labels=c("Men", "Women")),Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,39,59,Inf),labels=c("Under 20", "20-39","40-59","60 and over")),# Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles) # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the # Mental Health - Depression Screener (DPQ_H) documentation WTMEC4YR = WTMEC2YR/2 ,# Define indicator for analysis population of interest: adults aged 20 and over with a valid depression scoreinAnalysis= (RIDAGEYR >= 20 & !is.na(Depression.Score))) %>% # drop DPQ variablesselect(., -starts_with("DPQ"))由于使用了兩年的數(shù)據(jù),因此weight需要計(jì)算,WTMEC4YR=WTMEC2YR
4.定義survey數(shù)據(jù)集
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)選擇子集
NHANES <- subset(NHANES_all, inAnalysis)5.統(tǒng)計(jì)分析
計(jì)算加權(quán)均值及標(biāo)準(zhǔn)差,定義函數(shù)
getSummary <- function(varformula, byformula, design){# Get mean, stderr, and unweighted sample sizec <- svyby(varformula, byformula, design, unwtd.count ) p <- svyby(varformula, byformula, design, svymean ) outSum <- left_join(select(c,-se), p) outSum }計(jì)算抑郁分層的結(jié)果?
getSummary(~Depression, ~one, NHANES) #' By sex getSummary(~Depression, ~Gender, NHANES) #' By age getSummary(~Depression, ~Age.Group, NHANES) #' By sex and age getSummary(~Depression, ~Gender + Age.Group, NHANES)注意,在NHANSE數(shù)據(jù)庫使用過程中,首先要定義survey數(shù)據(jù)集,再進(jìn)行subset運(yùn)算,? 不能直接subset取子集計(jì)算,否則會(huì)導(dǎo)致有偏估計(jì)。
例二:以今年發(fā)表在EST上的文獻(xiàn)為例
Exposure: chloroform (TCM); bromodichloromethane (BDCM); dibromochloromethane (DBCM); bromoform (TBM)
Outcome: thyroid function (FT4;FT3; TT4;TT3; TPOAb; TgAb)
Exclusion criterion: thyroid diseases; prescription medications, pregnant status, <20 years old
Covariates: demographic data; serum cotinine
Year: 2007-2008
1.數(shù)據(jù)下載
根據(jù)文獻(xiàn)中的暴露變量及結(jié)局變量,協(xié)變量,下載相應(yīng)數(shù)據(jù)集。
2.數(shù)據(jù)導(dǎo)入
setwd("C:\\Users\\18896\\Desktop\\NHANSE20211110\\example1") library(foreign)DEMO_E <- read.xport("DEMO_E.XPT") BMX_E<-read.xport("BMX_E.XPT") MCQ_E <- read.xport("MCQ_E.XPT") RHQ_E <- read.xport("RHQ_E.XPT") RXQ_RX_E <- read.xport("RXQ_RX_E.XPT") RXQ_RX_E<-subset(RXQ_RX_E,!duplicated(RXQ_RX_E$SEQN))THYROD_E <- read.xport("THYROD_E.XPT") VOCMWB_E <- read.xport("VOCMWB_E.XPT")3.數(shù)據(jù)合并?
##數(shù)據(jù)庫整合 ##并集(Union) data_E <- DEMO_E data_E <- merge(data_E, MCQ_E, by = "SEQN", all = T) data_E <- merge(data_E, BMX_E, by = "SEQN", all = T) data_E <- merge(data_E, RHQ_E, by = "SEQN", all = T) data_E <- merge(data_E, RXQ_RX_E, by = "SEQN", all = T) data_E <- merge(data_E, THYROD_E, by = "SEQN", all = T) data_E <- merge(data_E, VOCMWB_E, by = "SEQN", all = T)SEQN為唯一ID識(shí)別碼,注意merge時(shí),使用all=T,否則會(huì)丟失樣本。
4.數(shù)據(jù)重命名
在進(jìn)行數(shù)據(jù)計(jì)算時(shí),需要將我們選擇的變量進(jìn)行重新命名以更好識(shí)別
data_new <- plyr::rename(data_E,c(RIDAGEYR="age",DMDEDUC2="Education",RIDEXPRG="pregnant.status",RIAGENDR="Gender",RIDRETH1="race",BMXWT="weight",BMXHT="height",BMXBMI="BMI",LBXVBF="Bromoform",LBXVBM="Bromodichloromethane",LBXVCF="Chloroform",LBXVCM="Dibromochloromethane",LBXT3F="FT3",LBXT4F="FT4",LBXTT3="TT3",LBXTT4="TT4",LBXTPO="TPOAb",LBXATG="TgAb",RXDUSE="medication",MCQ160M="thyroid.deseases"))5.數(shù)據(jù)加權(quán)
library(survey) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=data_new)design_new<-subset(design,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 &!is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) &!is.na(TPOAb) & !is.na(TgAb))data_2<-subset(data_new,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 &!is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) &!is.na(TPOAb) & !is.na(TgAb))6.一般情況分析
計(jì)算了數(shù)據(jù)中的年齡及種族加權(quán)及未加權(quán)的均值或比例,可以看出加權(quán)及未加權(quán)結(jié)果有很大差異,對(duì)數(shù)據(jù)進(jìn)行基線信息描述時(shí),應(yīng)該使用加權(quán)結(jié)果。
#unweighted age and se mean(data_2$age,na.rm=T) #49.54916# weighted age and se svymean(~age, design_new, na.rm = TRUE) #45.874#' Proportion of unweighted interview sample data_2 %>% count(race) %>% mutate(prop= round(n / sum(n)*100, digits=1))#' Proportion of weighted interview sample data_2 %>% count(race, wt=WTMEC2YR) %>%mutate(prop= round(n / sum(n)*100, digits=1))具體在論文中呈現(xiàn)時(shí),可以參考以下方式
7.svyglm分析
使用常規(guī)的glm和weighted glm會(huì)對(duì)結(jié)果進(jìn)行有偏估計(jì),應(yīng)該在構(gòu)建survey數(shù)據(jù)庫的基礎(chǔ)上,進(jìn)行svyglm分析,以下是三個(gè)方法的比較。
#glm Result2 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education,family = gaussian(), data=data_2) summary(Result2)#weighted glm Result3 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education,family = gaussian(), data=data_2,weights =WTMEC2YR ) summary(Result3)#survey-weighted glm Result1 <- svyglm(TT4~Bromoform+age+Gender+race+BMI+Education,family = gaussian(), data=data_2,design=design_new) summary(Result1)ref:
Sun Y, Xia PF, Korevaar TI, Mustieles V, Zhang Y, Pan XF, Wang YX, Messerlian C. Relationship between Blood Trihalomethane Concentrations and Serum Thyroid Function Measures in US Adults. Environmental Science & Technology. 2021 Oct 7.
Brody DJ, Pratt LA, Hughes JP. Prevalence of depression among adults aged 20 and over: United States, 2013-2016.
Emecen-Huja P, Li HF, Ebersole JL, Lambert J, Bush H. Epidemiologic evaluation of Nhanes for environmental Factors and periodontal disease. Scientific reports. 2019 Jun 3;9(1):1-1.
總結(jié)
以上是生活随笔為你收集整理的NHANES数据库的介绍及使用(二)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 什么是鸭子类型(duck typing)
- 下一篇: 达梦数据库日志挖掘