knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Haoxue Wang, Li Xing, Xuekui Zhang, Igor Burstyn, Paul Gustafson
This package includes key functions in LBC(logistic Box–Cox regression) model and a NHANES Dataset to test the model.
There are three major benefits of using the LBC model. First of all, compared with the logistic regression model, the LBC model has flexibility in estimating the non-linear relationship between the log-odds and exposure via a shape parameter. It can accommodate a wide range of non-linear shapes. Thus, it is superior to the common two-step approach, which adds a data transformation step before fitting a logistic regression model to accommodate any non-linearity in the disease-exposure relationship. Second, the LBC model helps us design future studies. For instance, we can estimate the sample size based on a desired accuracy of the shape parameter estimator, using prior knowledge regarding the strength of association, disease prevalence, exposure skewness, and apparent pattern of non-linearity. Third, as a by-product of the LBC model, the median effect represents the gradient measurement over the entire distribution of the predictor with respect to a generalized scale, which facilitates the interpretation of this non-linear model.
myfit.prof() for estimates the LBC model parameter.
LogLikeFun() for caculating the log-likelihood value for Maximum Likelihood Estimates in LBC model.
ScoreFun() for caculating the gradient value for Maximum Likelihood Estimates in LBC model.
HessFun() for calculating the Hessian matrix.
library(LBC)
firstly we library all the packages we need
library(survey) library(maxLik) #library(ggplot2) library(MASS)
We analyze data from the National Health and Nutrition Examination Survey (NHANES) 2009–2010, which involved 8,893 adults aged 20 and over with complete records in depression and blood mercury measurements. The exposure variable, X, is the total blood mercury in micrograms per litre (𝜇g/L), and the binary outcome for depression, Y, is dichotomized from the score of the Patient Health Questionnaire-9 (PHQ-9) with 0 indicating no depression (a PHQ-9 score ≤ 9) and 1 indicating depression (a PHQ-9 score ≥ 10). We controlled for potential confounding associated with demographic and socioeconomic characteristics and dietary intake, Z, where Z contains the following covariates: age, gender, ethnicity, quintiles of family income-to-poverty ratio, education levels, marital status, smoking status, self-reported drinking status, body mass index, self-reported fish or shellfish intake in the past 30 days.
mydata <- subset(read.csv("Data/n05_08.csv", as.is = T), !is.na(wtdrd14yr)) # returns a value of true and false for each value in a data set. colnames(mydata) <- tolower(colnames(mydata)) # convert the uppercase letters of string to lowercase string head(mydata) # display the first n rows present in the input data dim(mydata) # returns the dimension (e.g. the number of columns and rows) of a matrix # transformed exposure variable mydata$lbxthg.id <- (mydata$lbxthg - 1) mydata$lbxthg.sqrt <- (sqrt(mydata$lbxthg) - 1) / 0.5 mydata$lbxthg.log <- log(mydata$lbxthg) mydata$lbxthg.sq <- ((mydata$lbxthg) ^ 2 - 1) / 2
Because NHANES is conducted using a complex sampling design, the statistical methods based on simple random sampling can introduce bias for population level parameter estimates. Instead we employed sampling-weighted statistics and incorporated sampling weights in the regular models to achieve estimates that should better represent the population (Lumley, 2011)
# fit the sample design mysubdata <- subset(mydata, (ridageyr >= 20) & (!is.na(mydata$depr_bi2)) & (!is.na(mydata$lbxthg))) mysub <- svydesign(id = ~ 1, weights = ~ wtdrd14yr, data = mysubdata) # id:Formula or data frame specifying cluster ids from largest level to smallest level # ~0 or ~1 is a formula for no clusters. # weights: specify sampling weights as an alternative to prob # Alternative definition ##mydesign <- svydesign(id=~1, weights=~wtdrd14yr, data = mydata) ##mysub2 <- subset(mydesign, (ridageyr>=20)&(!is.na(mydata$depr_bi2))&(!is.na(mydata$lbxthg)))
Calculate the weighted mean and the weighted variance
myu <- svymean(~ log(lbxthg), mysub)[1] mysigma <- sqrt(svyvar(~ log(lbxthg), mysub))[1] # the design is mysub myXX = mysub$variables$lbxthg # UpperBound = 2 NoOFIni <- 1000 myseq <- seq(0, UpperBound, length = NoOFIni) myAIC <- rep(NA, NoOFIni) for (ii in 1:NoOFIni) { lambda0 = myseq[ii] if (lambda0 == 0) { myV0 <- log(myXX) } if (lambda0 != 0) { myV0 <- (myXX ^ lambda0 - 1) / lambda0 } mysub$variables$myv0 <- myV0 mylogit <- svyglm( depr_bi2 ~ myv0 + ridageyr + factor(riagendr) + factor(race_cat) + #Demorgraphic Variables factor(pir_cat5) + factor(edu_cat3) + factor(mar_cat) + # SES factor(smoker) + factor(regdrink) + factor(bmi_cat) + # health behavior and health indicator factor(fish30) + factor(shellfish30) + factor(epa_cat3) + factor(dha_cat3) , # dietary consumption design = mysub ) myAIC[ii] <- AIC(mylogit, k = 2)[2] } myID <- order(myAIC)[1] mylambda <- myseq[myID] plot(myseq, myAIC) inits <- rep(NA, length(coef(mylogit)) + 1) inits[1:2] <- coef(mylogit)[1:2] inits[3] <- mylambda inits[4:(length(coef(mylogit)) + 1)] <- coef(mylogit)[-c(1:2)]
Fit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors.
summary( fit1 <- svyglm( depr_bi2 ~ lbxthg.log + ridageyr + factor(riagendr) + factor(race_cat) + #Demorgraphic Variables factor(pir_cat5) + factor(edu_cat3) + factor(mar_cat) + # SES factor(smoker) + factor(regdrink) + factor(bmi_cat) + # health behavior and health indicator factor(fish30) + factor(shellfish30) + factor(epa_cat3) + factor(dha_cat3) , # dietary consumption design = mysub ) ) glm( depr_bi2 ~ lbxthg.log + ridageyr + factor(riagendr) + factor(race_cat) + #Demorgraphic Variables factor(pir_cat5) + factor(edu_cat3) + factor(mar_cat) + # SES factor(smoker) + factor(regdrink) + factor(bmi_cat) + # health behavior and health indicator factor(fish30) + factor(shellfish30) + factor(epa_cat3) + factor(dha_cat3) , weights = wtdrd14yr, data = mysub$variables ) # Fit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors. summary( fit2 <- svyglm( depr_bi2 ~ lbxthg.sqrt + ridageyr + factor(riagendr) + factor(race_cat) + #Demorgraphic Variables factor(pir_cat5) + factor(edu_cat3) + factor(mar_cat) + # SES factor(smoker) + factor(regdrink) + factor(bmi_cat) + # health behavior and health indicator factor(fish30) + factor(shellfish30) + factor(epa_cat3) + factor(dha_cat3) , # dietary consumption design = mysub ) ) summary( fit3 <- svyglm( depr_bi2 ~ lbxthg.id + ridageyr + factor(riagendr) + factor(race_cat) + #Demorgraphic Variables factor(pir_cat5) + factor(edu_cat3) + factor(mar_cat) + # SES factor(smoker) + factor(regdrink) + factor(bmi_cat) + # health behavior and health indicator factor(fish30) + factor(shellfish30) + factor(epa_cat3) + factor(dha_cat3) , # dietary consumption design = mysub ) ) summary( fit4 <- svyglm( depr_bi2 ~ lbxthg.sq + ridageyr + factor(riagendr) + factor(race_cat) + #Demorgraphic Variables factor(pir_cat5) + factor(edu_cat3) + factor(mar_cat) + # SES factor(smoker) + factor(regdrink) + factor(bmi_cat) + # health behavior and health indicator factor(fish30) + factor(shellfish30) + factor(epa_cat3) + factor(dha_cat3) , # dietary consumption design = mysub ) ) AIC(fit1, k = 2) AIC(fit2, k = 2) AIC(fit3, k = 2) AIC(fit4, k = 2) summary(fit1) summary(fit2) summary(fit3) summary(fit4)
handle the data
mylist <- c( "ridageyr", "riagendr", "race_cat", "pir_cat5", "edu_cat3", "mar_cat", "smoker", "regdrink", "bmi_cat", "fish30", "shellfish30", "epa_cat3", "dha_cat3" ) apply(mysub$variables[, mylist[-c(1, 2)]], 2, table) mysub$variables$race_cat1 <- mysub$variables$race_cat mysub$variables$race_cat1[which(mysub$variables$race_cat %in% c(2, 3))] <- 0 mysub$variables$race_cat2 <- mysub$variables$race_cat / 2 mysub$variables$race_cat2[which(mysub$variables$race_cat %in% c(1, 3))] <- 0 mysub$variables$race_cat3 <- mysub$variables$race_cat / 3 mysub$variables$race_cat3[which(mysub$variables$race_cat %in% c(1, 2))] <- 0 mysub$variables$pir_cat5.1 <- mysub$variables$pir_cat5 mysub$variables$pir_cat5.1[which(mysub$variables$pir_cat5 %in% c(2:4))] <- 0 mysub$variables$pir_cat5.2 <- mysub$variables$pir_cat5 / 2 mysub$variables$pir_cat5.2[which(mysub$variables$pir_cat5 %in% c(1, 3, 4))] <- 0 mysub$variables$pir_cat5.3 <- mysub$variables$pir_cat5 / 3 mysub$variables$pir_cat5.3[which(mysub$variables$pir_cat5 %in% c(1, 2, 4))] <- 0 mysub$variables$pir_cat5.4 <- mysub$variables$pir_cat5 / 4 mysub$variables$pir_cat5.4[which(mysub$variables$pir_cat5 %in% c(1:3))] <- 0 mysub$variables$edu_cat3.1 <- mysub$variables$edu_cat3 mysub$variables$edu_cat3.1[which(mysub$variables$edu_cat3 %in% c(2))] <- 0 mysub$variables$edu_cat3.2 <- mysub$variables$edu_cat3 / 2 mysub$variables$edu_cat3.2[which(mysub$variables$edu_cat3 %in% c(1))] <- 0 mysub$variables$mar_cat.1 <- mysub$variables$mar_cat mysub$variables$mar_cat.1[which(mysub$variables$mar_cat %in% c(2, 3))] <- 0 mysub$variables$mar_cat.2 <- mysub$variables$mar_cat / 2 mysub$variables$mar_cat.2[which(mysub$variables$mar_cat %in% c(1, 3))] <- 0 mysub$variables$mar_cat.3 <- mysub$variables$mar_cat / 3 mysub$variables$mar_cat.3[which(mysub$variables$mar_cat %in% c(1, 2))] <- 0 mysub$variables$bmi_cat.1 <- mysub$variables$bmi_cat mysub$variables$bmi_cat.1[which(mysub$variables$bmi_cat %in% c(2, 3))] <- 0 mysub$variables$bmi_cat.2 <- mysub$variables$bmi_cat / 2 mysub$variables$bmi_cat.2[which(mysub$variables$bmi_cat %in% c(1, 3))] <- 0 mysub$variables$bmi_cat.3 <- mysub$variables$bmi_cat / 3 mysub$variables$bmi_cat.3[which(mysub$variables$bmi_cat %in% c(1, 2))] <- 0 mysub$variables$epa_cat3.1 <- mysub$variables$epa_cat3 mysub$variables$epa_cat3.1[which(mysub$variables$epa_cat3 %in% c(2))] <- 0 mysub$variables$epa_cat3.2 <- mysub$variables$epa_cat3 / 2 mysub$variables$epa_cat3.2[which(mysub$variables$epa_cat3 %in% c(1))] <- 0 mysub$variables$dha_cat3.1 <- mysub$variables$dha_cat3 mysub$variables$dha_cat3.1[which(mysub$variables$dha_cat3 %in% c(2))] <- 0 mysub$variables$dha_cat3.2 <- mysub$variables$dha_cat3 / 2 mysub$variables$dha_cat3.2[which(mysub$variables$dha_cat3 %in% c(1))] <- 0 mylist <- c( "ridageyr", "riagendr", "race_cat1", "race_cat2", "race_cat3", "pir_cat5.1", "pir_cat5.2", "pir_cat5.3", "pir_cat5.4", "edu_cat3.1", "edu_cat3.2", "mar_cat.1", "mar_cat.2", "mar_cat.3", "smoker", "regdrink", "bmi_cat.1", "bmi_cat.2", "bmi_cat.3", "fish30", "shellfish30", "epa_cat3.1", "epa_cat3.2", "dha_cat3.1", "dha_cat3.2" ) mysubsub <- subset(mysub, rowSums(apply(mysub$variables[, mylist], 2, is.na)) == 0) iZZ <- mysubsub$variables[, mylist] ixx <- mysubsub$variables$lbxthg iyy <- mysubsub$variables$depr_bi2 iw <- mysubsub$variables$wtdrd14yr # ixx: continuous predictor # iyy: binary outcome # iZZ: covariates to be incorporated in the model # iw: The weighted parameter source("R/FunsForOptimalV2.r") myLBC <- maxLik( logLik = LogLikeFun, grad = ScoreFun, start = inits, ixx = ixx, iyy = iyy, iw = iw, iZZ = as.matrix(iZZ) ) # as.matrix returns all values of iZZ as a matrix myLBC2 <- maxLik( logLik = LogLikeFun2, grad = ScoreFun2, start = inits[-3], ixx = ixx, iyy = iyy, iw = iw, iZZ = as.matrix(iZZ) ) ixx.log <- log(mysubsub$variables$lbxthg) ixx.sqrt <- sqrt(mysubsub$variables$lbxthg) ixx.sq <- (mysubsub$variables$lbxthg) ^ 2
myLBC.log <- maxLik( logLik = LogLikeFun2, grad = ScoreFun2, start = inits[-3], ixx = ixx.log, iyy = iyy, iw = iw, iZZ = as.matrix(iZZ) ) myLBC.sqrt <- maxLik( logLik = LogLikeFun2, grad = ScoreFun2, start = inits[-3], ixx = ixx.sqrt, iyy = iyy, iw = iw, iZZ = as.matrix(iZZ) ) myLBC.ind <- maxLik( logLik = LogLikeFun2, grad = ScoreFun2, start = inits[-3], ixx = ixx, iyy = iyy, iw = iw, iZZ = as.matrix(iZZ) ) myLBC.sq <- maxLik( logLik = LogLikeFun2, grad = ScoreFun2, start = inits[-3], ixx = ixx.sq, iyy = iyy, iw = iw, iZZ = as.matrix(iZZ) ) # calculate the Slope myLBC.log$estimate[2] myLBC.sqrt$estimate[2] myLBC.ind$estimate[2] myLBC.sq$estimate[2] # Calculate AIC-2 * myLBC.log$maximum + 2 * 27-2 * myLBC.sqrt$maximum + 2 * 27-2 * myLBC.ind$maximum + 2 * 27-2 * myLBC.sq$maximum + 2 * 27 # calculate the SE sqrt(-ginv(myLBC.log$hessian)[2, 2] / length(ixx)) sqrt(-ginv(myLBC.sqrt$hessian)[2, 2] / length(ixx)) sqrt(-ginv(myLBC.ind$hessian)[2, 2] / length(ixx)) sqrt(-ginv(myLBC.sq$hessian)[2, 2] / length(ixx)) # calculate the ARD # absolute relative difference (ARD) to measure the difference between the large sample limit (myLBC.log$estimate[2] + 0.1546) / 0.1546 (myLBC.sqrt$estimate[2] + 0.1549) / 0.1549 (myLBC.ind$estimate[2] + 0.1551) / 0.1551 (myLBC.sq$estimate[2] + 0.1556) / 0.1556 #myLBC2$estimate #myLBC$estimate #23.33556/sqrt(dim(mysubsub)[1]) #5.68067/sqrt(dim(mysubsub)[1]) #31.58821/sqrt(dim(mysubsub)[1])
Note that the alternative linear models are the weighted logistic linear models on $X^{(q)}$ scale with the choices of $q$ = 0, 0.5 and 1. The various results are summarized. In comparing the four linear models, we found the square-root model ($q$ = 0.5) has the smallest AIC, which suggests it is the preferred model among these four possibilities. On the other hand, if we look at ARD, the smallest ARD occurs between the log model ($q$ = 0) and the LBC model, which is consistent with our previous finding. That is, the slope estimated from the simple linear model with log-transformed predictor is relatively close to the median effect. It is noteworthy that all four model choices support the apparent inverse association between the blood mercury level and the prevalence of depression in this sample.
beta0 <- myLBC$estimate[1] beta1 <- myLBC$estimate[2] lambda <- myLBC$estimate[3] myV <- (-1 / lambda) aa <- exp(beta0 + beta1 * myV + myLBC$estimate[4:28] %*% apply(iZZ, 2, mean)) aa / (1 + aa) bb <- exp(beta0 + myLBC$estimate[4:28] %*% apply(iZZ, 2, mean)) beta1 * bb / (1 + bb) ^ 2 mydelta <- beta1 * exp((lambda - 1) * myu) mymat <- (-ginv(myLBC$hessian)[1:3, 1:3] / dim(mysubsub)[1]) varbeta1 <- mymat[2, 2] covbeta1lambda <- mymat[2, 3] varlambda <- mymat[3, 3] sqrt(mydelta ^ 2 * ( varbeta1 / beta1 ^ 2 + 2 * myu * covbeta1lambda / beta1 + myu ^ 2 * varlambda )) mydelta + 2 * 0.06851505 myLBC$estimate[2] * exp((0.61789 - 0) * myu) myLBC$estimate[2] * exp((0.61789 - 2) * myu) # calculate the median effects beta1 * exp((lambda - 0) * myu) beta1 * exp((lambda - 0.5) * myu) beta1 * exp((lambda - 1) * myu) beta1 * exp((lambda - 2) * myu) (-0.0096180 - beta1 * exp((lambda - 0) * myu)) / (beta1 * exp((lambda - 0) * myu)) (-0.0072421 - beta1 * exp((lambda - 0.5) * myu)) / (beta1 * exp((lambda - 0.5) * myu)) (-0.0031768 - beta1 * exp((lambda - 1) * myu)) / (beta1 * exp((lambda - 1) * myu)) (-0.0001916 - beta1 * exp((lambda - 2) * myu)) / (beta1 * exp((lambda - 2) * myu)) # # Calculate the CI of the prevalence # Z0 <- as.vector(c(1, apply(iZZ, 2, mean))) Var.Beta <- ginv(-myLBC$hessian) / dim(mysubsub)[1] delta.h.beta <- as.vector(c(1 / lambda, -beta1 / lambda ^ 2)) # Multivariate Delta Method to calculate the variance of beta1/lambda part1 <- t(delta.h.beta) %*% Var.Beta[2:3, 2:3] %*% (delta.h.beta) part2 <- Z0 %*% Var.Beta[-c(2:3), -c(2:3)] %*% Z0 1.96 * sqrt(part1 + part2) aa1 <- exp(beta0 + beta1 * myV + myLBC$estimate[4:28] %*% apply(iZZ, 2, mean) - 1.96 * sqrt(part1 + part2)) aa1 / (1 + aa1) aa2 <- exp(beta0 + beta1 * myV + myLBC$estimate[4:28] %*% apply(iZZ, 2, mean) + 1.96 * sqrt(part1 + part2)) aa2 / (1 + aa2)
The model parameter estimates $$ \hat{\beta}{0}=-1.555(\mathrm{SE} 0.280), \hat{\beta}{1}=-0.155(\mathrm{SE} 0.068) \text { and } \hat{\lambda}=0.618(\mathrm{SE} 0.380) $$ The estimated prevalence $$ \begin{aligned} \hat{\operatorname{Pr}}\left(Y=1|X=0| Z=Z_{0}\right) &=\left.\operatorname{expit}\left(\hat{\beta}{0}+\hat{\beta}{1} X^{(\hat{\lambda})}+\hat{\eta}^{T} Z_{0}\right)\right|{X=0, Z=Z{0}} \ &=7.4 \% \end{aligned} $$ where $Z_{0}$ consists of the means of continuous covariates and population proportions of categorical covariates representing the average population level. The corresponding $95 \%$ CI is $(0.047,0.114)$. This prevalence is close to the reported overall depression rate, $0.081$, for American adults aged 20 and over having depression in a given 2 -week period during 2013-2016 (Brody, Pratt \& Hughes, 2018).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.