knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The Bayesian Weighted Quantile Sum (BWQS) regression is a statistical model for accommodating mixtures. The algorithm provides a weighted index of the mixture components and its overall association with an outcome, which can be continuous or binary. The overall association of the mixture index (BWQS index) is estimated adjusting for a set of covariates. The prior distributions of all parameters are considered uninformative. Parameters mapped to the overall mixture, and the covariates follow uaninformative Normal distributions, while we assumed a Dirichlet prior density for the weights. The BWQS package has several dependencies, which are uploaded automatically with the BWQS package.
The algorithms are applied to a simulated dataset and all the steps are listed in order to make reproducible analyses. In this first part of the code the environment is set up with all the useful libraries.
library(BWQS) library(rstan) library(MASS) library(knitr) library(clusterGeneration) library(kableExtra)
The function bwqs
fits the Bayesian Quantile Weighted Sums model. The object family
allows the model to be used for both continuous or binary outcomes. The model is the following $$y_i = \beta_0 + \beta_1 \sum_{i=0}^5w_jq_{ij} + \sum_{k=0}^2 \delta_kc_{ik}$$ where $\beta_0$ is the intercept, $\beta_1$ is the overall association of the mixture with the outcome, $\sum_{i=0}^5w_jq_{ij}$ is the mixture composition made by the product between weights and quartiles (q = 4
in bwqs
function) of the metals concentrations.
Synthetic dataset made of 1000 observations with a mixture composed by 5 metals (Al, As, Cd, Pb, V). The coefficients are $\beta_0 = -2$, $\beta_1 = 0.8$, the set of weights $w = [0.50,0.20,0.20,0.05,0.05]$ and the overall $\sigma = 1$. Two additional covariates are included in the model with two coefficients $\delta_1 = 0.5$ and $\delta_2 = 1.1$. Here the dataset:
# fix the seed set.seed(1990) # Sample sizes N = 1000 # Mean & SD of variables mu = c(0,0,0,1,3) sd = c(1,1,1,3,1) sd2 = sd %*% t(sd) # Correlation Matrix rho = 0.65 corMat = cbind(c(1,rho,rho^2,rho^2,rho^2), c(rho,1,rho^2, rho^2, rho^2), c(rho^2,rho^2,1,rho^2,rho^2), c(rho^2,rho^2,rho^2,1, rho), c(rho^2,rho^2,rho^2,rho,1)) # Covariance Matrix Sigma = sd2*corMat # Simulating five correlated exposure variables (Metals) X = as.data.frame(mvrnorm(N, mu = mu, Sigma = Sigma, empirical = TRUE)) colnames(X) = c('Al','As','Cd','Pb','V') # Quantile trasformation Xq = as.matrix(quantile_split(X, mix_name = colnames(X), q=4)) # Intercept coefficient beta0 = -2 # Overall effect beta1 = -0.8 # Weights W = c(0.5,0.20,0.2,0.05,0.05) # sigma of the model sigma = 1 # Simulate covariates D = cbind(rnorm(N,1.3,1), rnorm(N,0.5,0.5)) colnames(D) = c("cov1","cov2") # coefficient of covariates delta = c(0.5, 1.1) # Outcome simulation (continuos) y = rnorm(N, beta0 + beta1*(Xq %*% W) + D %*% delta, sd = sigma) # Aggregate data in a data.frame Data = as.data.frame(cbind(y,X,D)) head(round(Data,3))
Now that we have the dataset we can fit the BWQS model with the following code:
# we run the model ans save results in "fit_bwqs" variable fit_bwqs = bwqs(y ~ cov1 + cov2, mix_name = c('Al','As','Cd','Pb','V'), data = Data, q = 4, family = "gaussian")
The first function object is the underlying model, i.e. formula y ~ cov1 + cov2
; here we have the outcome y
and two covariates. Additional covariates can be included in the formula separated by +
. If no covariates are included in the model, the formula has to be specified as y ~ NULL
. The mix_name
object contains a vector of strings with the name of the mixture components. Data is the name of the dataset, containing the outcome, the mixture components, and the covariates, note that the dataset should ve in data.frame
format. The q
object indicates how the mixture components are categorized (as quantiles), by default is set on 4 (quartiles), but also allows deciles (q = 10
), centiles (q = 100
) or continuous values (q = NULL
). Other parameters can be modified; please check the BWQS package documentation for details.
Results are saved in the fit_bwqs
object, which can be used to plot the results. In the results table: the first two columns are the mean and the standard deviation of the mean, the two following columns contain the values of the credible interval at 95%. The credible interval level can be chosen using the c_int
parameter. The last two columns report values to evaluate the fitness of parameters and the convergence of the algorithm.
fit_bwqs$summary_fit %>% kbl() %>% kable_classic(full_width = F, html_font = "Cambria")
The following plots summarize the results: estimates and 95% credible intervals. The first object of the function contains the results of the model. parms
is a vector with the name of the variables to be plotted: beta0
and beta1
for the main parameters of the model, W
for the weights and delta
for the covariates. The size
identifies the magnitude of the means dots.
# bwqs_plot(fit_bwqs, parms = "W", size = 2)
# bwqs_plot(fit_bwqs, parms = c("beta0","beta1",'delta',"sigma"), size = 2)
The bwqs_waic
function evaluates the goodness of fit for the BWQS model. Several measures are included in the function. For further details check here.
bwqs_waic(fit_bwqs$fit)
The bwqs
function allows the user to fit model even with different type of outcomes like binary or poisson. In order to fit model with different outcomes we should specify the family
parameter in the bwqs
function, this parameter can be "binomial"
or "poisson"
based on the outcome that we want to use to build our model. Note that binomial only allows ${0,1}$ values while poisson allows integer number; in both cases the misspecification of the values will lead to errors.
Another important feature of bwqs
function is the possibility to set the direction of the prior of the $\beta_1$ coefficient (the coefficient that explain the effect of the mixture). Developing model can happen that we have prior strong information on the constraint that the $\beta_1$ can have, in this case we only have to specify the parameter prior
. The default value is "None"
but it allows also "positive"
and "negative"
based on the assumptions that we have on the overall effect of the mixture.
Here an example for the regression model:
$$\begin{gather} Y_{i} \sim \mathcal{N}( \alpha_{j[i]} + \beta_{j[i]}BWQS_{i} + \gamma Z_{i}, \sigma_c) \quad where \quad BWQS_{i} = \sum_{k=1}^K x_{ik}w_k \ \alpha \sim \mathcal{N}( \mu_{\alpha},\sigma_{\alpha}), \quad \beta \sim \mathcal{N}( \mu_{\beta},\sigma_{\beta}) \quad where: \ \begin{pmatrix} \mu_{\alpha} \ \mu_{\beta} \end{pmatrix} \sim \mathcal{N}2\left( \begin{pmatrix} 0 \ 0 \end{pmatrix}, \begin{bmatrix} 100 && 0 \ 0 && 100 \end{bmatrix} \right) \quad and \quad \sigma{\alpha},\sigma_{\beta} \sim \mathcal{IG}(0.01,0.01) \ \sigma_c \sim \mathcal{IG}(0.01,0.01), \quad w \sim \mathcal{Dir}k((1,\ldots,1)) \quad and \quad \gamma \sim \mathcal{N}{z} \left( \begin{pmatrix} 0 \ \vdots \ 0 \end{pmatrix} ,\begin{bmatrix} 100 && \ & \ddots & \ && 100 \end{bmatrix} \right) \ \end{gather}$$
Notation:
The HBWQS frameworks is pretty similar to the BWQS. The main different is the flexibility on multiple cohorts. In order to specify the cohort we need to use and additional parameter cohort
which is a vector of integer that needs to point every observation to the specific cohort.
We generate a synthetic dataset to test the HBWQS model: we want 5 different cohorts with 250,200,150,100,100 people for each cohort. We choose 4 chemical for the mixture and we are using quartiles. Note that STAN doesn't allow factors so we have to re-map the categorical variables into dummy variables (excluding the reference value). The index of cohort should be an integer starting from 1 and we'll specify the column name in code. Here the synthetic dataset:
set.seed(1990) N = c(250,150,100,100,200) C = 4 names_ch = paste0("Ch_",1:C) # Mean and variance of each chemical mu <- rnorm(C) sd <- runif(C,0,0.2) sd2 <- sd %*% t(sd) # Correlation Matrix corMat = rcorrmatrix(C,alphad=1) # Simulating three correlated exposure varMatrix Sigma <- sd2*corMat X <- as.data.frame(mvrnorm(sum(N), mu=mu, Sigma = Sigma, empirical=TRUE)) colnames(X)<-names_ch # Quantile transformation Xq = matrix(NA,sum(N),C) for(i in 1:C) Xq[,i] = ecdf(X[[names_ch[i]]])(X[[names_ch[i]]])*4 beta0_1 = 0.65 # intercept 1st cohort beta1_1 = (-1.7) # slope 1st cohort beta0_2 = -0.38 # intercept 2nd cohort beta1_2 = -0.94 # slope 2nd cohort beta0_3 = 0.15 # intercept 3rd cohort beta1_3 = -0.11 # slope 3rd cohort beta0_4 = (-0.98) # intercept 4th cohort beta1_4 = -1.45 # slope 4th cohort beta0_5 = 1.13 # intercept 5th cohort beta1_5 = -1.88 # slope 5th cohort W = c(0.1,0.4,0.3,0.2) # weights sigma = 1 # sigma K = cbind(runif(sum(N),20,40), rbinom(sum(N),1,prob = 0.5), runif(sum(N),11,18)) colnames(K) = c("m_age","sex","education") d = c(-0.4,0.5,0.2) Nh = cumsum(N) y1 = rnorm(N[1], beta0_1 + beta1_1 * (Xq %*% W)[1:Nh[1]] + (K %*% d)[1:Nh[1]],sigma) y2 = rnorm(N[2], beta0_2 + beta1_2 * (Xq %*% W)[(Nh[1]+1):Nh[2]] + (K %*% d)[(Nh[1]+1):Nh[2]],sigma) y3 = rnorm(N[3], beta0_3 + beta1_3 * (Xq %*% W)[(Nh[2]+1):Nh[3]] + (K %*% d)[(Nh[2]+1):Nh[3]],sigma) y4 = rnorm(N[4], beta0_4 + beta1_4 * (Xq %*% W)[(Nh[3]+1):Nh[4]] + (K %*% d)[(Nh[3]+1):Nh[4]],sigma) y5 = rnorm(N[5], beta0_5 + beta1_5 * (Xq %*% W)[(Nh[4]+1):Nh[5]] + (K %*% d)[(Nh[4]+1):Nh[5]],sigma) dt = data.frame(y = c(y1,y2,y3,y4,y5), K, X, cohort = rep(c(1,2,3,4,5),N)) head(round(dt,3))
Now that we have the dataset we can fit the HBWQS model with the following code:
# we run the model ans save results in "fit_hbwqs" variable fit_hbwqs = bwqs_r(y ~ m_age + sex + education, mix_name = c("Ch_1","Ch_2","Ch_3","Ch_4"), cluster_name = "cohort", Dalp = rep(1,4), iter = 10000, thin = 3, data = dt, q = 4, family = "gaussian")
fit_hbwqs$summary_fit %>% kbl() %>% kable_classic(full_width = F, html_font = "Cambria")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.