# Estimating and Testing Direct Effects in Directed Acyclic Graphs using Estimating Equations" In CIEE: Estimating and Testing Direct Effects in Directed Acyclic Graphs using Estimating Equations

## Overview

In any association study, it is important to distinguish direct and indirect effects in order to build truly functional models. For this purpose, we consider a directed acyclic graph (DAG) setting with an exposure variable, primary and intermediate outcome variables, and confounding factors. In order to make valid statistical inference on the direct effect of the exposure on the primary outcome, it is necessary to consider all potential effects in the graph, and we propose to use the estimating equation method with robust Huber-White sandwich standard errors. Then, a large-sample Wald-type test statistic is computed for testing the absence of the direct effect. In this package, the proposed causal inference method based on estimating equations (CIEE) is implemented for both the analysis of continuous and time-to-event primary outcomes subject to censoring for the model in Figure 1. Additionally, standard multiple regression, regression of residuals, and the structural equation approach are implemented for fitting the same model.

Results from simulation studies (Konigorski et al., 2018) showed that CIEE successfully removes the effect of intermediate outcomes from the primary outcome and is robust against measured and unmeasured confounding of the indirect effect through observed factors. Also, an application in a genetic association study in the same study showed that CIEE can identify genetic variants that would be missed by traditional regression methods. Both multiple regression methods and the structural equation method fail in some scenarios where their corresponding test statistics lead to inflated type I errors. An alternative approach for the analysis of continuous traits is the sequential G-estimation method (Vansteelandt et al., 2009).

In this package, CIEE is implemented for the model described in the DAG in Figure 1, which includes the direct effect $\alpha_{XY}$ of an exposure X on the primary outcome Y and an indirect effect of X on Y through a secondary outcome K. The model further includes measured and unmeasured factors L and U, respectively, which potentially confound the effect of K on Y. CIEE can also be applied to different models with different error distributions. The goal is to estimate and test the direct effect $\alpha_{XY}$, while removing the indirect effect of X on Y through K, and with robustness against effects of L and U. Without restriction of generality, it is assumed that there aren't any factors affecting X and that any such factors are included as covariates in the analysis or have been dealt with using other approaches. Also, we generally assume that either $\alpha_{LY}=0$ (L is a factor influencing K) or $\alpha_{XL}=0$ (L is a measured confounder of $K \to Y$). Otherwise, the effect of L as intermediate outcome could be removed from Y in the analysis analogously to K.

## Alternative approaches

Two traditional methods for the aim to estimate and test $\alpha_{XY}$ are (i) to include the intermediate outcomes and factors as covariates in a multiple regression (MR) model of the primary outcome on the exposure, or (ii) to first regress the primary outcome on the intermediate outcome and factors, and then regress the extracted residuals on the exposure (regression of residuals, RR). In more detail, estimates of $\alpha_{XY}$ are obtained from fitting the following models in the quantitative outcome setting for a normally-distributed Y (GLM setting):

MR: Obtain the least squares (LS) estimate of $\alpha_{XY}$ by fitting $$Y_i = \alpha_0 + \alpha_{XY} x_i + \alpha_1 k_i + \alpha_2 l_i + \varepsilon_i, \ \varepsilon_i \sim N(0,\sigma_1^2)$$.

RR: First, obtain residuals $\hat{\epsilon}{1i} = y_i - \hat{\alpha}_0 - \hat{\alpha}_1 k_i - \hat{\alpha}_2 l_i$ by fitting
$$Y_i = \alpha_0 + \alpha_1 k_i + \alpha_2 l_i + \varepsilon {1i}, \ \varepsilon_{1i} \sim N(0,\sigma_1^2 )$$ using the LS estimation. Second, obtain the LS estimate of $\alpha_{XY}$ by fitting $$\hat{\varepsilon}{1i} = \alpha_3 + \alpha{XY} x_i + \varepsilon_{2i}, \ \varepsilon_{2i} \sim N(0, \sigma_2^2).$$

Then, $H_0: \alpha_{XY} = 0$ versus $H_A: \alpha_{XY} \neq 0$ is tested using the default t-test in the lm() function in R. For the analysis of a censored time-to-event primary trait Y (accelerated failure time setting; AFT), only the MR approach is implemented. Here, the equivalent censored log-linear regression model is fitted using the survreg() function in the survival R package to obtain the maximum likelihood estimate of $\alpha_{XY}$, and a Wald-type test is performed for testing the null hypothesis $H_0: \alpha_{XY} = 0$. Both approaches are implemented in the functions mult_reg() and res_reg() and can be used as follows. For this illustration, data is first generated using the generate_data() function, which generates data for the quantitative outcomes Y and K, a genetic marker X (single nucleotide polymorphism, SNP, taking values 0, 1, 2) as exposure, and observed as well as unobserved confounders L, U.

generate_data <- function(setting = "GLM", n = 1000, maf = 0.2, cens = 0.3,
a = NULL, b = NULL, aXK = 0.2, aXY = 0.1, aXL = 0,
aKY = 0.3, aLK = 0, aLY = 0, aUY = 0, aUL = 0,
mu_X = NULL, sd_X = NULL, X_orth_U = TRUE, mu_U = 0,
sd_U = 1, mu_K = 0, sd_K = 1, mu_L = 0, sd_L = 1,
mu_Y = 0, sd_Y = 1) {
U_out <- rnorm(n, mean = mu_U, sd = sd_U)
if (setting == "AFT" & (is.null(a) | is.null(b))) {
stop("a and b have to be specified under the AFT setting.")
}
if (X_orth_U == TRUE) {
X_out <- rbinom(n, size = 2, prob = maf)
}
if (X_orth_U == FALSE) {
X_out <- pnorm(U_out, mean = mu_X, sd = sd_X)
p <- 1 - maf
for (j in 1:length(X_out)) {
if (X_out[j] < p^2) {
X_out[j] <- 0
next
}
if (X_out[j] >= p^2 & X_out[j] < p^2 + 2 * p * (1 - p)) {
X_out[j] <- 1
next
}
if (X_out[j] >= p^2 + 2 * p * (1 - p)) {
X_out[j] <- 2
next
}
}
}
L_out <- aUL * U_out + aXL * X_out + rnorm(n, mean = mu_L, sd = sd_L)
K_out <- aXK * X_out + aLK * L_out + rnorm(n, mean = mu_K, sd = sd_K)
Y_out <- aUY * U_out + aKY * K_out + aXY * X_out + aLY * L_out +
rnorm(n, mean = mu_Y, sd = sd_Y)
data <- data.frame(Y = Y_out, K = K_out, X = X_out, L = L_out, U = U_out)
if (setting == "AFT") {
T_help <- exp(Y_out)
### Create censoring indicator and censored times no censoring
if (cens == 0) {
T_out <- T_help
C_out <- rep(1, n)  # C_out==0 is censored, C_out==1 is uncensored
}
if (!cens == 0) {
# there is censoring; cens is the percentage of censored data
T_cens <- runif(n, min = a, max = b)  # a, b for desired censoring rate
C_out <- as.numeric(T_help < T_cens)  # C==0 censored, C==1 uncensored
T_out <- pmin(T_help, T_cens)
Y_out <- log(T_out)
}
cens_out <- sum(abs(C_out - 1))/n
data <- data.frame(Y = Y_out, K = K_out, X = X_out, L = L_out, U = U_out,
T = T_out, C = C_out)
print(paste("The empirical censoring rate obtained through the specified parameters a=", a, " and b=", b, " is ", cens_out, ".", sep = ""))
if (abs(cens_out - cens) > 0.1) {
warning(paste("This obtained empirical censoring rate is quite different from the desired censoring rate cens=", cens, ". Please check and adapt values for a and b.",
sep = ""))
}
}
return(data)
}
mult_reg <- function(setting = "GLM", Y = NULL, X = NULL, K = NULL,
L = NULL, C = NULL) {
if (!requireNamespace("survival", quietly = TRUE)) {
stop("Pkg needed for this function to work. Please install it.", call. = FALSE)
}
if (is.null(setting)) {
stop("setting has to be supplied.")
}
if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) {
stop("Data of one or more variables are not supplied.")
}
if (setting == "AFT" & is.null(C)) {
stop("C has to be supplied for the AFT setting.")
}
if (setting == "GLM") {
fit_mult_reg <- lm(Y ~ K + X + L)
point_estimates <- c(summary(fit_mult_reg)$coefficients[1, 1], summary(fit_mult_reg)$coefficients[2, 1],
summary(fit_mult_reg)$coefficients[3, 1], summary(fit_mult_reg)$coefficients[4, 1])
SE_estimates <- c(summary(fit_mult_reg)$coefficients[1, 2], summary(fit_mult_reg)$coefficients[2, 2],
summary(fit_mult_reg)$coefficients[3, 2], summary(fit_mult_reg)$coefficients[4, 2])
pvalues <- c(summary(fit_mult_reg)$coefficients[1, 4], summary(fit_mult_reg)$coefficients[2, 4],
summary(fit_mult_reg)$coefficients[3, 4], summary(fit_mult_reg)$coefficients[4, 4])
names(point_estimates) <- names(SE_estimates) <- names(pvalues) <-
c("alpha_0", "alpha_1", "alpha_XY", "alpha_2")
}
if (setting == "AFT") {
fit_mult_reg <- survival::survreg(survival::Surv(Y, C) ~ K + X + L,
dist = "gaussian")
point_estimates <- c(summary(fit_mult_reg)$table[1, 1], summary(fit_mult_reg)$table[2, 1],
summary(fit_mult_reg)$table[3, 1], summary(fit_mult_reg)$table[4, 1])
SE_estimates <- c(summary(fit_mult_reg)$table[1, 2], summary(fit_mult_reg)$table[2, 2],
summary(fit_mult_reg)$table[3, 2], summary(fit_mult_reg)$table[4, 2])
pvalues <- c(summary(fit_mult_reg)$table[1, 4], summary(fit_mult_reg)$table[2, 4],
summary(fit_mult_reg)$table[3, 4], summary(fit_mult_reg)$table[4, 4])
names(point_estimates) <- names(SE_estimates) <- names(pvalues) <-
c("alpha_0", "alpha_1", "alpha_XY", "alpha_2")
}
return(list(point_estimates = point_estimates,
SE_estimates = SE_estimates, pvalues = pvalues))
}
res_reg <- function(Y = NULL, X = NULL, K = NULL, L = NULL) {
if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) {
stop("Data of one or more variables is not supplied.")
}
data_help <- data.frame(Y = Y, X = X, K = K, L = L)
data_help <- data_help[complete.cases(data_help), ]
fit_res_reg_1 <- lm(data_help$Y ~ data_help$K + data_help$L) res <- fit_res_reg_1$residuals
fit_res_reg_2 <- lm(res ~ data_help$X) point_estimates <- c(summary(fit_res_reg_1)$coefficients[1, 1],
summary(fit_res_reg_1)$coefficients[2, 1], summary(fit_res_reg_1)$coefficients[3, 1],
summary(fit_res_reg_2)$coefficients[1, 1], summary(fit_res_reg_2)$coefficients[2, 1])
SE_estimates <- c(summary(fit_res_reg_1)$coefficients[1, 2], summary(fit_res_reg_1)$coefficients[2, 2],
summary(fit_res_reg_1)$coefficients[3, 2], summary(fit_res_reg_2)$coefficients[1, 2],
summary(fit_res_reg_2)$coefficients[2, 2]) pvalues <- c(summary(fit_res_reg_1)$coefficients[1, 4],
summary(fit_res_reg_1)$coefficients[2, 4], summary(fit_res_reg_1)$coefficients[3, 4],
summary(fit_res_reg_2)$coefficients[1, 4], summary(fit_res_reg_2)$coefficients[2, 4])
names(point_estimates) <- names(SE_estimates) <- names(pvalues) <-
c("alpha_0", "alpha_1", "alpha_2", "alpha_3", "alpha_XY")
return(list(point_estimates = point_estimates,
SE_estimates = SE_estimates, pvalues = pvalues))
}
dat <- generate_data(setting="GLM", n = 1000, maf = 0.2, cens = 0.3, a = NULL, b = NULL,
aUL = 0, aXL = 0, aXK = 0.2, aLK = 0, aUY = 0, aKY = 0.3, aXY = 0.1,
aLY = 0, mu_U = 0, sd_U = 1, X_orth_U = TRUE, mu_X = NULL,
sd_X = NULL, mu_L = 0, sd_L = 1, mu_K = 0, sd_K = 1, mu_Y = 0,
sd_Y = 1)
mult_reg(setting = "GLM", Y = dat$Y, X = dat$X, K = dat$K, L = dat$L)
res_reg(Y = dat$Y, X = dat$X, K = dat$K, L = dat$L)

As another approach for modeling DAGs, the structural equation modeling method (SEM; Bollen, 1989) can be used. Among others, it is implemented in the sem() function of the lavaan package (Rosseel, 2012). For a comparison of the results, the function sem_appl() applies the SEM method to the DAG in Figure 1 based on the following model equations:

$$L_i = \alpha_0 + \alpha_1 x_i + \varepsilon_{1i}, \ \varepsilon_{1i} \sim N(0,\sigma_1^2 )$$ $$K_i = \alpha_2 + \alpha_3 x_i + \alpha_2 l_i + \varepsilon_{2i}, \ \varepsilon_{2i} \sim N(0,\sigma_2^2 )$$ $$Y_i = \alpha_5 + \alpha_6 k_i + \alpha_{XY} x_i + \varepsilon_{3i}, \ \varepsilon_{3i} \sim N(0,\sigma_3^2 )$$

sem_appl <- function(setting = "GLM", Y = NULL, X = NULL, K = NULL, L = NULL) {
if (!requireNamespace("lavaan", quietly = TRUE)) {
stop("Pkg needed for this function to work. Please install it.", call. = FALSE)
}
if (is.null(setting)) {
stop("setting has to be supplied.")
}
if (setting == "AFT") {
stop("Only GLM setting is implemented.")
}
if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) {
stop("Data of one or more variables are not supplied.")
}
data_help <- data.frame(Y = Y, X = X, K = K, L = L)
data_help <- data_help[complete.cases(data_help), ]
model <- "
L ~ X
K ~ X + L
Y ~ K + X
"
fit <- lavaan::sem(model, data = data_help)

point_estimates <- c(fit@Fit@est[1], fit@Fit@est[2], fit@Fit@est[3],
fit@Fit@est[4], fit@Fit@est[5])
SE_estimates <- c(fit@Fit@se[1], fit@Fit@se[2], fit@Fit@se[3],
fit@Fit@se[4], fit@Fit@se[5])
pvalues <- 2 * pnorm(-abs(point_estimates/SE_estimates))
names(point_estimates) <- names(SE_estimates) <- names(pvalues) <-
c("alpha_1", "alpha_3", "alpha_4", "alpha_6", "alpha_XY")
return(list(point_estimates = point_estimates,
SE_estimates = SE_estimates, pvalues = pvalues))
}
sem_appl(Y = dat$Y, X = dat$X, K = dat$K, L = dat$L)

Further proposed approaches for fitting the model in Figure 1 include the two-stage sequential G-estimation method (Vansteelandt et al., 2009). It first removes the effect of K from the primary outcome Y, and then tests the association of X with the adjusted primary outcome.

In more detail, for the analyis of a quantitative outcome Y with n independent observations, in the first stage, the effect of K on Y, $\alpha_1$, is estimated and $\hat{\alpha}_1$ is obtained using the LS estimation method under the model

$$Y_i = \alpha_0 + \alpha_1 k_i + \alpha_2 x_i + \alpha_3 l_i + \varepsilon_i, \ \varepsilon_i \sim N(0,\sigma_1^2), \quad i=1,…,n$$

Then, to block all indirect paths from X to Y, the adjusted outcome $\tilde{Y}$ is obtained by removing the effect of K on Y with

$$\tilde{y}_i = y_i - \bar{y} - \hat{\alpha}_1 (k_i - \bar{k})$$

where $\bar{y} = \sum_{i=1}^n y_i$ and $\bar{k} = \sum_{i=1}^n k_i$. In the second stage, the significance of the direct effect of X on Y, $\alpha_{XY}$, is tested under the model

\tilde{Y}i = \alpha_4 + \alpha{XY} x_i + \varepsilon_i, \ \varepsilon_i \sim N(0,\sigma_2^2 )

using the proposed test statistic in Vansteelandt and colleagues (2009). The approach is implemented in the CGene package, which can be obtained from cran.r-project.org/src/contrib/Archive/CGene/.

## Causal inference using estimating equations (CIEE)

CIEE follows the general idea of the two-stage sequential G-estimation method with the major difference that the approach is one-stage and obtains coefficient estimates of all parameters simultaneously by solving estimating equations. This also allows building on existing asymptotic properties of the estimator and obtaining robust sandwich standard error estimates considering the additional variability of the estimates from the outcome adjustment.

In more detail, for the analyis of a quantitative outcome Y, we formulate unbiased estimating equations $U(\theta)=0$ for a consistent estimation of the unknown parameter vector $\theta = (\alpha_0, \alpha_1, \alpha_2, \alpha_3, \sigma_1^2, \alpha_4, \alpha_{XY}, \sigma_2^2)^T$ where

$$U(\theta) = \left( \frac{\partial l_1(\theta)}{\partial \alpha_0}, \frac{\partial l_1(\theta)}{\partial \alpha_1}, \frac{\partial l_1(\theta)}{\partial \alpha_2}, \frac{\partial l_1(\theta)}{\partial \alpha_3}, \frac{\partial l_1(\theta)}{\partial \sigma_1^2}, \frac{\partial l_1(\theta)}{\partial \alpha_4}, \frac{\partial l_1(\theta)}{\partial \alpha_{XY}}, \frac{\partial l_1(\theta)}{\partial \sigma_2^2} \right)^T$$ with

$$l_1(\theta) = \sum_{i=1}^n \left[ -log(\sigma_1) + log\left( \phi\left( \frac{y_i - \alpha_0 - \alpha_1 k_i - \alpha_2 x_i - \alpha_3 l_i}{\sigma_1} \right) \right) \right]$$

and

$$l_2(\theta) = \sum_{i=1}^n \left[ -log(\sigma_2) + log\left( \phi\left( \frac{y_i - \bar{y} - \alpha_1 (k_i - \bar{k}) - \alpha_4 - \alpha_{XY} x_i}{\sigma_2} \right) \right) \right],$$

where $\phi$ is the probability density function of the standard normal distribution.

By solving the first five estimating equations based on $l_1(\theta)$, we are hence obtaining estimates of $\alpha_0, \alpha_1, \alpha_2, \alpha_3, \sigma_1^2$. Analogously, solving the last three estimating equations based on $l_2(\theta)$ yields estimates of $\alpha_4, \alpha_{XY}, \sigma_2^2$. To give an intuition on how these estimating equations are obtained, $l_1(\theta)$ is the log-likelihood function under the model in (1) and $l_2(\theta)$ is the log-likelihood function under the model in (3) given that $\alpha_1$ is known. All parameters in $\theta$ are estimated simultaneously and the additional variability obtained in the outcome adjustment in (2) is considered by using the robust Huber-White sandwich estimator of the standard error of $\hat{\theta}$. Then, the large sample Wald-type test statistic $\hat{\theta}/\widehat{SE}(\hat{\theta})$, which has an asymptotic standard normal distribution, is computed to test the absence of the exposure effect $\alpha_{XY}$.

For the analysis of a censored time-to-event primary outcome Y, the estimating equations can be constructed as described above for a quantitative primary phenotype (for the same parameters except for $\sigma_1$ instead of $\sigma_1^2$), but in order to remove the effect of K from Y, the true underlying log survival times $Y_{est}$ need to be estimated for censored survival times. For uncensored survival times, $Y_{est}$ equals the observed log-survival time Y. Then, the estimating equations are constructed accordingly, the robust Huber-White sandwich estimator of the standard error is obtained and the large-sample Wald-type tests are computed. For more statistical details, see Konigorski et al. (2018).

In the implementation of CIEE in this package, the est_funct_expr() function contains the estimating equations as an expression.

est_funct_expr <- function(setting = "GLM") {
if (is.null(setting)) {
stop("setting has to be supplied.")
}
if (setting == "GLM") {
logL1 <- expression(log((1/sqrt(sigma1sq)) * dnorm((y_i - alpha0 -
alpha1 * k_i - alpha2 * x_i - alpha3 * l_i)/
sqrt(sigma1sq), mean = 0, sd = 1)))
logL2 <- expression(log((1/sqrt(sigma2sq)) * dnorm((y_i - y_bar -
alpha1 * (k_i - k_bar) - alpha4 - alphaXY * x_i)/
sqrt(sigma2sq), mean = 0, sd = 1)))
}
if (setting == "AFT") {
logL1 <- expression(-c_i * log(sigma1) + c_i * log(dnorm((y_i -
alpha0 - alpha1 * k_i - alpha2 * x_i -
alpha3 * l_i)/sigma1, mean = 0, sd = 1)) +
(1 - c_i) * log(1 - pnorm((y_i - alpha0 -
alpha1 * k_i - alpha2 * x_i - alpha3 * l_i)/
sigma1, mean = 0, sd = 1)))
logL2 <- expression(log((1/sqrt(sigma2sq)) * dnorm(((c_i * y_i +
(1 - c_i) * ((alpha0 + alpha1 * k_i +
alpha2 * x_i + alpha3 * l_i) + (sigma1 *
dnorm((y_i - alpha0 - alpha1 * k_i -
alpha2 * x_i - alpha3 * l_i)/sigma1, mean = 0,
sd = 1)/(1 - pnorm((y_i - alpha0 - alpha1 * k_i -
alpha2 * x_i - alpha3 * l_i)/sigma1, mean = 0,
sd = 1))))) - y_adj_bar - alpha1 * (k_i - k_bar) -
alpha4 - alphaXY * x_i)/sqrt(sigma2sq),
mean = 0, sd = 1)))
}
return(list(logL1 = logL1, logL2 = logL2))
}
estfunct <- est_funct_expr(setting="GLM")
estfunct
est_funct_expr(setting = "AFT")

The function get_estimates() obtains estimates of the parameters in the models (1)-(3) by using the lm() and survreg() functions for computational purposes. The estimates are identical to estimates obtained by solving the estimating equations.

get_estimates <- function(setting = "GLM", Y = NULL, X = NULL, K = NULL,
L = NULL, C = NULL) {
if (!requireNamespace("survival", quietly = TRUE)) {
stop("Pkg needed for this function to work. Please install it.", call. = FALSE)
}
if (is.null(setting)) {
stop("setting has to be supplied.")
}
if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) {
stop("Data of one or more variables are not supplied.")
}
if (setting == "AFT" & is.null(C)) {
stop("C has to be supplied for the AFT setting.")
}
n <- length(Y)

if (setting == "GLM") {
data_help <- data.frame(Y = Y, X = X, K = K, L = L)
data_help <- data_help[complete.cases(data_help), ]
######### Stage 1 #########
fit_stage_1 <- lm(data_help$Y ~ data_help$K + data_help$X + data_help$L)

alpha_0_out <- summary(fit_stage_1)$coefficients[1, 1] alpha_1_out <- summary(fit_stage_1)$coefficients[2, 1]
alpha_2_out <- summary(fit_stage_1)$coefficients[3, 1] alpha_3_out <- summary(fit_stage_1)$coefficients[4, 1]
sigma_1_sq_out <- (n - 4)/n * summary(fit_stage_1)$sigma^2 ######### Stage 2 ######### Y_tilde <- data_help$Y - mean(data_help$Y) - alpha_1_out * (data_help$K - mean(data_help$K)) fit_stage_2 <- lm(Y_tilde ~ data_help$X)

alpha_4_out <- summary(fit_stage_2)$coefficients[1, 1] alpha_XY_out <- summary(fit_stage_2)$coefficients[2, 1]
sigma_2_sq_out <- (n - 2)/n * summary(fit_stage_2)$sigma^2 point_estimates <- c(alpha_0_out, alpha_1_out, alpha_2_out, alpha_3_out, sigma_1_sq_out, alpha_4_out, alpha_XY_out, sigma_2_sq_out) names(point_estimates) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3", "sigma_1_sq", "alpha_4", "alpha_XY", "sigma_2_sq") } if (setting == "AFT") { data_help <- data.frame(Y = Y, X = X, K = K, L = L, C = C) data_help <- data_help[complete.cases(data_help), ] ######### Stage 1 ######### fit_stage_1 <- survival::survreg(survival::Surv(data_help$Y, data_help$C) ~ data_help$K + data_help$X + data_help$L,
dist = "gaussian")

alpha_0_out <- summary(fit_stage_1)$table[1, 1] alpha_1_out <- summary(fit_stage_1)$table[2, 1]
alpha_2_out <- summary(fit_stage_1)$table[3, 1] alpha_3_out <- summary(fit_stage_1)$table[4, 1]
sigma_1_out <- fit_stage_1$scale ######### Stage 2 ######### mu <- fit_stage_1$linear.predictors
Y_adj <- data_help$C * data_help$Y + (1 - data_help$C) * (mu + (sigma_1_out * dnorm((data_help$Y - mu)/sigma_1_out,
mean = 0, sd = 1)/(1 - pnorm((data_help$Y - mu)/sigma_1_out, mean = 0, sd = 1)))) Y_tilde <- Y_adj - mean(Y_adj) - alpha_1_out * (data_help$K - mean(data_help$K)) fit_stage_2 <- lm(Y_tilde ~ data_help$X)

alpha_4_out <- summary(fit_stage_2)$coefficients[1, 1] alpha_XY_out <- summary(fit_stage_2)$coefficients[2, 1]
sigma_2_sq_out <- (n - 2)/n * summary(fit_stage_2)$sigma^2 point_estimates <- c(alpha_0_out, alpha_1_out, alpha_2_out, alpha_3_out, sigma_1_out, alpha_4_out, alpha_XY_out, sigma_2_sq_out, mean(Y_adj)) names(point_estimates) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3", "sigma_1", "alpha_4", "alpha_XY", "sigma_2_sq", "y_adj_bar") } return(point_estimates) } estimates <- get_estimates(setting = "GLM", Y = dat$Y, X = dat$X, K = dat$K, L = dat$L) estimates In order to compute the robust Huber-White sandwich estimator of the parameters, in a first step, the deriv_obj() function computes the expression of all first and second derivatives for the 8 parameters$\alpha_0, \alpha_1, \alpha_2, \alpha_3, \sigma_1^2, \alpha_4, \alpha_{XY}, \sigma_2^2$by using the expressions from the est_funct_expr() function as input. Then, the numerical values of all first and second derivatives are obtained for the observed data and parameter point estimates for all observed individuals, using the scores() and hessian() functions. deriv_obj <- function(setting = "GLM", logL1 = NULL, logL2 = NULL, Y = NULL, X = NULL, K = NULL, L = NULL, C = NULL, estimates = NULL) { if (is.null(setting) | is.null(logL1) | is.null(logL2) | is.null(estimates)) { stop("One or more arguments of the function are missing.") } if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) { stop("Data of one or more variables are not supplied.") } if (setting == "AFT" & is.null(C)) { stop("C has to be supplied for the AFT setting.") } n <- length(Y) if (setting == "GLM") { data_help <- data.frame(Y = Y, X = X, K = K, L = L) data_help <- data_help[complete.cases(data_help), ] U12345_i <- deriv(expr = logL1, namevec = c("alpha0", "alpha1", "alpha2", "alpha3", "sigma1sq", "alpha4", "alphaXY", "sigma2sq"), function.arg = c("y_i", "k_i", "x_i", "l_i", "alpha0", "alpha1", "alpha2", "alpha3", "sigma1sq"), func = T, hessian = T) U678_i <- deriv(expr = logL2, namevec = c("alpha0", "alpha1", "alpha2", "alpha3", "sigma1sq", "alpha4", "alphaXY", "sigma2sq"), function.arg = c("y_i", "k_i", "x_i", "y_bar", "k_bar", "alpha1", "alpha4", "alphaXY", "sigma2sq"), func = T, hessian = T) logL1_deriv <- attributes(U12345_i(y_i = data_help$Y, k_i = data_help$K, x_i = data_help$X, l_i = data_help$L, alpha0 = estimates[names(estimates) == "alpha_0"], alpha1 = estimates[names(estimates) == "alpha_1"], alpha2 = estimates[names(estimates) == "alpha_2"], alpha3 = estimates[names(estimates) == "alpha_3"], sigma1sq = estimates[names(estimates) == "sigma_1_sq"])) logL2_deriv <- attributes(U678_i(y_i = data_help$Y, k_i = data_help$K, x_i = data_help$X, y_bar = mean(data_help$Y), k_bar = mean(data_help$K),
alpha1 = estimates[names(estimates) == "alpha_1"],
alpha4 = estimates[names(estimates) == "alpha_4"],
alphaXY = estimates[names(estimates) == "alpha_XY"],
sigma2sq = estimates[names(estimates) == "sigma_2_sq"]))
deriv_obj <- list(logL1_deriv = logL1_deriv, logL2_deriv = logL2_deriv)
}
if (setting == "AFT") {
data_help <- data.frame(Y = Y, X = X, K = K, L = L, C = C)
data_help <- data_help[complete.cases(data_help), ]
U12345_i <- deriv(expr = logL1, namevec = c("alpha0", "alpha1", "alpha2",
"alpha3", "sigma1", "alpha4", "alphaXY", "sigma2sq"),
function.arg = c("y_i", "c_i", "k_i", "x_i", "l_i",
"alpha0", "alpha1", "alpha2", "alpha3", "sigma1"),
func = T, hessian = T)
U678_i <- deriv(expr = logL2, namevec = c("alpha0", "alpha1", "alpha2",
"alpha3", "sigma1", "alpha4", "alphaXY", "sigma2sq"),
function.arg = c("y_i", "c_i", "k_i", "x_i", "l_i",
"alpha3", "sigma1", "alpha4", "alphaXY", "sigma2sq"),
func = T, hessian = T)
logL1_deriv <- attributes(U12345_i(y_i = data_help$Y, c_i = data_help$C,
k_i = data_help$K, x_i = data_help$X, l_i = data_help$L, alpha0 = estimates[names(estimates) == "alpha_0"], alpha1 = estimates[names(estimates) == "alpha_1"], alpha2 = estimates[names(estimates) == "alpha_2"], alpha3 = estimates[names(estimates) == "alpha_3"], sigma1 = estimates[names(estimates) == "sigma_1"])) logL2_deriv <- attributes(U678_i(y_i = data_help$Y, c_i = data_help$C, k_i = data_help$K, x_i = data_help$X, l_i = data_help$L,
k_bar = mean(data_help$K), alpha0 = estimates[names(estimates) == "alpha_0"], alpha1 = estimates[names(estimates) == "alpha_1"], alpha2 = estimates[names(estimates) == "alpha_2"], alpha3 = estimates[names(estimates) == "alpha_3"], sigma1 = estimates[names(estimates) == "sigma_1"], alpha4 = estimates[names(estimates) == "alpha_4"], alphaXY = estimates[names(estimates) == "alpha_XY"], sigma2sq = estimates[names(estimates) == "sigma_2_sq"])) deriv_obj <- list(logL1_deriv = logL1_deriv, logL2_deriv = logL2_deriv) } return(deriv_obj) } scores <- function(derivobj = NULL) { if (is.null(derivobj)) { stop("derivobj has to be supplied.") } scores_out <- cbind(derivobj[[1]]$gradient[, 1:5],
derivobj[[2]]$gradient[, 6:8]) return(scores_out) } hessian <- function(derivobj = NULL) { if (is.null(derivobj)) { stop("derivobj has to be supplied.") } hessian_out <- derivobj[[1]]$hessian
hessian_out[, 6:8, ] <- derivobj[[2]]$hessian[, 6:8, ] return(hessian_out) } derivobj <- deriv_obj(setting = "GLM", logL1 = estfunct$logL1, logL2 = estfunct$logL2, Y = dat$Y, X = dat$X, K = dat$K, L = dat$L, estimates = estimates) names(derivobj) head(derivobj$logL1_deriv$gradient) score_matrix <- scores(derivobj) head(score_matrix) hessian_matrix <- hessian(derivobj) str(hessian_matrix) The robust Huber-White sandwich estimator of the standard error can then be obtained using the sandwich_se() function: sandwich_se <- function(setting = "GLM", scores = NULL, hessian = NULL) { if (is.null(scores) | is.null(hessian)) { stop("scores and hessian have to be supplied.") } n <- dim(scores)[1] ### A_n matrix ### A_n <- matrix(, nrow = 8, ncol = 8) for (A_n_i in 1:8) { for (A_n_j in 1:8) { A_n[A_n_i, A_n_j] <- sum(hessian[, A_n_i, A_n_j]) } } A_n <- -(1/n) * A_n ### B_n matrix ### B_n <- matrix(, nrow = 8, ncol = 8) for (B_n_i in 1:8) { for (B_n_j in 1:8) { B_n[B_n_i, B_n_j] <- sum(scores[, B_n_i] * scores[, B_n_j]) } } B_n <- (1/n) * B_n ### C_n matrix ### C_n <- solve(A_n) %*% B_n %*% t(solve(A_n)) ### Variance estimates of coefficients ### theta_EE_se <- (1/n) * diag(C_n) theta_EE_se <- sqrt(theta_EE_se) if (setting == "GLM") { names(theta_EE_se) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3", "sigma_1_sq", "alpha_4", "alpha_XY", "sigma_2_sq") } if (setting == "AFT") { names(theta_EE_se) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3", "sigma_1", "alpha_4", "alpha_XY", "sigma_2_sq") } return(theta_EE_se) } sandwich_se(scores = score_matrix, hessian = hessian_matrix) Alternatively, bootstrap standard error estimates can be computed using the bootstrap_se() function. Also, for comparison, the function naive_se() computes naive standard error estimates of the parameter estimates of$\alpha_0, \alpha_1, \alpha_2, \alpha_3, \alpha_4, \alpha_{XY}$without accounting for the additional variability due to the two stages in the model in (1)-(3): bootstrap_se <- function(setting = "GLM", BS_rep = 1000, Y = NULL, X = NULL, K = NULL, L = NULL, C = NULL) { if (is.null(setting)) { stop("setting has to be supplied.") } if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) { stop("Data of one or more variables are not supplied.") } if (setting == "AFT" & is.null(C)) { stop("C has to be supplied for the AFT setting.") } n <- length(Y) if (setting == "GLM") { alpha_0_SE_BS_help <- alpha_1_SE_BS_help <- alpha_2_SE_BS_help <- alpha_3_SE_BS_help <- sigma_1_sq_SE_BS_help <- alpha_4_SE_BS_help <- alpha_XY_SE_BS_help <- sigma_2_sq_SE_BS_help <- NULL } if (setting == "AFT") { alpha_0_SE_BS_help <- alpha_1_SE_BS_help <- alpha_2_SE_BS_help <- alpha_3_SE_BS_help <- sigma_1_SE_BS_help <- alpha_4_SE_BS_help <- alpha_XY_SE_BS_help <- sigma_2_sq_SE_BS_help <- NULL } for (rep in 1:BS_rep) { id <- sample(1:n, n, replace = TRUE) if (setting == "GLM") { data_id <- data.frame(X = X[id], L = L[id], K = K[id], Y = Y[id]) estimates <- get_estimates(setting = setting, Y = data_id$Y, X = data_id$X, K = data_id$K, L = data_id$L) } if (setting == "AFT") { data_id <- data.frame(X = X[id], L = L[id], K = K[id], Y = Y[id], T = T[id], C = C[id]) estimates <- get_estimates(setting = setting, Y = data_id$Y, X = data_id$X, K = data_id$K, L = data_id$L, C = data_id$C)
}
alpha_0_SE_BS_help[rep] <- estimates[names(estimates) == "alpha_0"]
alpha_1_SE_BS_help[rep] <- estimates[names(estimates) == "alpha_1"]
alpha_2_SE_BS_help[rep] <- estimates[names(estimates) == "alpha_2"]
alpha_3_SE_BS_help[rep] <- estimates[names(estimates) == "alpha_3"]
if (setting == "GLM") {
sigma_1_sq_SE_BS_help[rep] <- estimates[names(estimates) == "sigma_1_sq"]
}
if (setting == "AFT") {
sigma_1_SE_BS_help[rep] <- estimates[names(estimates) == "sigma_1"]
}
alpha_4_SE_BS_help[rep] <- estimates[names(estimates) == "alpha_4"]
alpha_XY_SE_BS_help[rep] <- estimates[names(estimates) == "alpha_XY"]
sigma_2_sq_SE_BS_help[rep] <- estimates[names(estimates) == "sigma_2_sq"]
}
if (setting == "GLM") {
theta_bootstrap_se <- c(sd(alpha_0_SE_BS_help,na.rm=T),
sd(alpha_1_SE_BS_help,na.rm=T),
sd(alpha_2_SE_BS_help,na.rm=T),
sd(alpha_3_SE_BS_help,na.rm=T),
sd(sigma_1_sq_SE_BS_help,na.rm=T),
sd(alpha_4_SE_BS_help,na.rm=T),
sd(alpha_XY_SE_BS_help,na.rm=T),
sd(sigma_2_sq_SE_BS_help,na.rm=T))
names(theta_bootstrap_se) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3",
"sigma_1_sq", "alpha_4", "alpha_XY", "sigma_2_sq")
}
if (setting == "AFT") {
theta_bootstrap_se <- c(sd(alpha_0_SE_BS_help,na.rm=T),
sd(alpha_1_SE_BS_help,na.rm=T),
sd(alpha_2_SE_BS_help,na.rm=T),
sd(alpha_3_SE_BS_help,na.rm=T),
sd(sigma_1_SE_BS_help,na.rm=T),
sd(alpha_4_SE_BS_help,na.rm=T),
sd(alpha_XY_SE_BS_help,na.rm=T),
sd(sigma_2_sq_SE_BS_help,na.rm=T))
names(theta_bootstrap_se) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3",
"sigma_1", "alpha_4", "alpha_XY", "sigma_2_sq")
}
return(theta_bootstrap_se)
}
naive_se <- function(setting = "GLM", Y = NULL, X = NULL,
K = NULL, L = NULL, C = NULL) {
if (!requireNamespace("survival", quietly = TRUE)) {
stop("Pkg needed for this function to work. Please install it.", call. = FALSE)
}
if (is.null(setting)) {
stop("setting has to be supplied.")
}
if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) {
stop("Data of one or more variables are not supplied.")
}
if (setting == "AFT" & is.null(C)) {
stop("C has to be supplied for the AFT setting.")
}
n <- length(Y)

if (setting == "GLM") {
data_help <- data.frame(Y = Y, X = X, K = K, L = L)
data_help <- data_help[complete.cases(data_help), ]
######### Stage 1 #########
fit_stage_1 <- lm(data_help$Y ~ data_help$K + data_help$X + data_help$L)

alpha_1_out <- summary(fit_stage_1)$coefficients[2, 1] alpha_0_SE_out <- summary(fit_stage_1)$coefficients[1, 2]
alpha_1_SE_out <- summary(fit_stage_1)$coefficients[2, 2] alpha_2_SE_out <- summary(fit_stage_1)$coefficients[3, 2]
alpha_3_SE_out <- summary(fit_stage_1)$coefficients[4, 2] ######### Stage 2 ######### Y_tilde <- data_help$Y - mean(data_help$Y) - alpha_1_out * (data_help$K - mean(data_help$K)) fit_stage_2 <- lm(Y_tilde ~ data_help$X)

alpha_4_SE_out <- summary(fit_stage_2)$coefficients[1, 2] alpha_XY_SE_out <- summary(fit_stage_2)$coefficients[2, 2]
}
if (setting == "AFT") {
data_help <- data.frame(Y = Y, X = X, K = K, L = L, C = C)
data_help <- data_help[complete.cases(data_help), ]
######### Stage 1 #########
fit_stage_1 <- survival::survreg(survival::Surv(data_help$Y, data_help$C) ~
data_help$K + data_help$X + data_help$L, dist = "gaussian") alpha_1_out <- summary(fit_stage_1)$table[2, 1]
sigma_1_out <- fit_stage_1$scale alpha_0_SE_out <- summary(fit_stage_1)$table[1, 2]
alpha_1_SE_out <- summary(fit_stage_1)$table[2, 2] alpha_2_SE_out <- summary(fit_stage_1)$table[3, 2]
alpha_3_SE_out <- summary(fit_stage_1)$table[4, 2] ######### Stage 2 ######### mu <- fit_stage_1$linear.predictors
Y_adj <- data_help$C * data_help$Y + (1 - data_help$C) * (mu + (sigma_1_out * dnorm((data_help$Y - mu)/sigma_1_out, mean = 0, sd = 1)/
(1 - pnorm((data_help$Y - mu)/sigma_1_out, mean = 0, sd = 1)))) Y_tilde <- Y_adj - mean(Y_adj) - alpha_1_out * (data_help$K - mean(data_help$K)) fit_stage_2 <- lm(Y_tilde ~ data_help$X)

alpha_4_SE_out <- summary(fit_stage_2)$coefficients[1, 2] alpha_XY_SE_out <- summary(fit_stage_2)$coefficients[2, 2]
}
SE_estimates <- c(alpha_0_SE_out, alpha_1_SE_out, alpha_2_SE_out, alpha_3_SE_out,
NA, alpha_4_SE_out, alpha_XY_SE_out, NA)
names(SE_estimates) <- c("alpha_0", "alpha_1", "alpha_2", "alpha_3", "sigma_1_sq",
"alpha_4", "alpha_XY", "sigma_2_sq")
return(SE_estimates)
}
bootstrap_se(setting = "GLM", BS_rep = 1000, Y = dat$Y, X = dat$X, K = dat$K, L = dat$L)
naive_se(setting = "GLM", Y = dat$Y, X = dat$X, K = dat$K, L = dat$L)

Finally, the functions ciee() and ciee_loop() allow an easy integrated use of all above functions and a simultaneous computation of the estimating equations approach using either standard error computation, the traditional regression-based approaches, and the SEM method. ciee() fits the model in equations (1)-(3) (e.g. the model in Figure 1) and yields parameter estimates, standard error estimates, and p-values for all parameters. ciee_loop() provides an extension of ciee() and allows the input of multiple exposure variables (e.g. multiple SNPs) which are tested sequentially. In the output of ciee_loop(), only the coefficient estimates, standard error estimates, and p-values with respect to the direct effect $\alpha_{XY}$ are provided.

ciee <- function(setting = "GLM", estimates = c("ee", "mult_reg", "res_reg", "sem"),
ee_se = c("sandwich"), BS_rep = NULL, Y = NULL, X = NULL, K = NULL,
L = NULL, C = NULL) {
if (is.null(setting)) {
stop("setting has to be supplied.")
}
if (is.null(estimates)) {
stop("At least one method has to be computed.")
}
if ((("ee" %in% estimates) & is.null(ee_se)) | (("ee" %in% estimates) &
length(ee_se) > 1)) {
stop("If the estimating equations approach is chosen, one approach has to be chosen for the computation of standard errors.")
}
if (("bootstrap" %in% estimates) & is.null(BS_rep)) {
stop("For the computation of bootstrap standard errors, the number of bootstrap samples has to be chosen.")
}
if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) {
stop("Data of one or more variables is not supplied.")
}
if (setting == "AFT" & is.null(C)) {
stop("C has to be supplied for the AFT setting.")
}
if (setting == "AFT" & ("sem" %in% estimates)) {
stop("The structural equations modeling approach is only implemented for the GLM setting.")
}
if (setting == "AFT" & ("res_reg" %in% estimates)) {
stop("The regression of residuals approach is only implemented for the GLM setting.")
}
if (setting == "GLM") {
data_help <- data.frame(Y = Y, X = X, K = K, L = L)
data_help <- data_help[complete.cases(data_help), ]
if ("sem" %in% estimates) {
results_sem <- sem_appl(Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L)
}
if ("mult_reg" %in% estimates) {
results_mult_reg <- mult_reg(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L)
}
if ("res_reg" %in% estimates) {
results_res_reg <- res_reg(Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L)
}
if ("ee" %in% estimates) {
point_estimates_ee <- get_estimates(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L)
if (ee_se == "sandwich") {
# Obtain estimating functions expressions
estfunct <- est_funct_expr(setting = "GLM")
# Obtain matrices with all first and second derivatives
derivobj <- deriv_obj(setting = setting, logL1 = estfunct$logL1, logL2 = estfunct$logL2, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L,
estimates = point_estimates_ee)
# Obtain score and hessian matrices
results_scores <- scores(derivobj)
results_hessian <- hessian(derivobj)
# Obtain sandwich standard error estimates of the parameters
se_estimates_ee <- sandwich_se(setting = setting,
scores = results_scores,
hessian = results_hessian)
}
if (ee_se == "bootstrap") {
se_estimates_ee <- bootstrap_se(setting = setting, BS_rep = BS_rep,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L)
}
if (ee_se == "naive") {
se_estimates_ee <- naive_se(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L)
}
wald_test_stat_ee <- point_estimates_ee[1:8]/se_estimates_ee
pvalues_ee <- 2 * pnorm(-abs(wald_test_stat_ee))
results_ee <- list(point_estimates = point_estimates_ee[1:8],
SE_estimates = se_estimates_ee,
wald_test_stat = wald_test_stat_ee,
pvalues = pvalues_ee)
}
}
if (setting == "AFT") {
data_help <- data.frame(Y = Y, X = X, K = K, L = L, C = C)
data_help <- data_help[complete.cases(data_help), ]
if ("mult_reg" %in% estimates) {
results_mult_reg <- mult_reg(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L, C = data_help$C) } if ("ee" %in% estimates) { point_estimates_ee <- get_estimates(setting = setting, Y = data_help$Y,
X = data_help$X, K = data_help$K,
L = data_help$L, C = data_help$C)
if (ee_se == "sandwich") {
# Obtain estimating functions expressions
estfunct <- est_funct_expr(setting = setting)
# Obtain matrices with all first and second derivatives
derivobj <- deriv_obj(setting = setting, logL1 = estfunct$logL1, logL2 = estfunct$logL2, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L, C = data_help$C, estimates = point_estimates_ee) # Obtain score and hessian matrices results_scores <- scores(derivobj) results_hessian <- hessian(derivobj) # Obtain sandwich standard error estimates of the parameters se_estimates_ee <- sandwich_se(setting = setting, scores = results_scores, hessian = results_hessian) } if (ee_se == "bootstrap") { se_estimates_ee <- bootstrap_se(setting = setting, BS_rep = BS_rep, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L, C = data_help$C)
}
if (ee_se == "naive") {
se_estimates_ee <- naive_se(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L, C = data_help$C) } wald_test_stat_ee <- point_estimates_ee[1:8]/se_estimates_ee pvalues_ee <- 2 * pnorm(-abs(wald_test_stat_ee)) results_ee <- list(point_estimates = point_estimates_ee[1:8], SE_estimates = se_estimates_ee, wald_test_stat = wald_test_stat_ee, pvalues = pvalues_ee) } } output <- list() if ("ee" %in% estimates) { output$results_ee <- results_ee
}
if ("mult_reg" %in% estimates) {
output$results_mult_reg <- results_mult_reg } if ("res_reg" %in% estimates) { output$results_res_reg <- results_res_reg
}
if ("sem" %in% estimates) {
output$results_sem <- results_sem } class(output) <- "ciee" return(output) } ciee_loop <- function(setting = "GLM", estimates = c("ee", "mult_reg", "res_reg", "sem"), ee_se = c("sandwich"), BS_rep = NULL, Y = NULL, X = NULL, K = NULL, L = NULL, C = NULL) { if (is.null(setting)) { stop("setting has to be supplied.") } if (is.null(estimates)) { stop("At least one method has to be computed.") } if ((("ee" %in% estimates) & is.null(ee_se)) | (("ee" %in% estimates) & length(ee_se) > 1)) { stop("If the estimating equations approach is chosen, one approach has to be chosen for the computation of standard errors.") } if (("bootstrap" %in% estimates) & is.null(BS_rep)) { stop("For the computation of bootstrap standard errors, the number of bootstrap samples has to be chosen.") } if (is.null(Y) | is.null(X) | is.null(K) | is.null(L)) { stop("Data of one or more variables is not supplied.") } if (setting == "AFT" & is.null(C)) { stop("C has to be supplied for the AFT setting.") } if (setting == "AFT" & ("sem" %in% estimates)) { stop("The structural equations modeling approach is only implemented for the GLM setting.") } if (setting == "AFT" & ("res_reg" %in% estimates)) { stop("The regression of residuals approach is only implemented for the GLM setting.") } if ("sem" %in% estimates) { results_sem <- list(point_estimates = NULL, SE_estimates = NULL, pvalues = NULL) } if ("mult_reg" %in% estimates) { results_mult_reg <- list(point_estimates = NULL, SE_estimates = NULL, pvalues = NULL) } if ("res_reg" %in% estimates) { results_res_reg <- list(point_estimates = NULL, SE_estimates = NULL, pvalues = NULL) } if ("ee" %in% estimates) { results_ee <- list(point_estimates = NULL, SE_estimates = NULL, wald_test_stat = NULL, pvalues = NULL) } k <- dim(X)[2] for (i in 1:k) { Xi <- X[, i] if (setting == "GLM") { data_help <- data.frame(Y = Y, X = Xi, K = K, L = L) data_help <- data_help[complete.cases(data_help), ] if ("sem" %in% estimates) { sem_help <- sem_appl(Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L) results_sem$point_estimates[i] <- sem_help$point_estimates[5] results_sem$SE_estimates[i] <- sem_help$SE_estimates[5] results_sem$pvalues[i] <- sem_help$pvalues[5] names(results_sem$point_estimates)[i] <-
names(results_sem$SE_estimates)[i] <- names(results_sem$pvalues)[i] <- names(X)[i]
}
if ("mult_reg" %in% estimates) {
mult_reg_help <- mult_reg(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L)
results_mult_reg$point_estimates[i] <- mult_reg_help$point_estimates[3]
results_mult_reg$SE_estimates[i] <- mult_reg_help$SE_estimates[3]
results_mult_reg$pvalues[i] <- mult_reg_help$pvalues[3]
names(results_mult_reg$point_estimates)[i] <- names(results_mult_reg$SE_estimates)[i] <-
names(results_mult_reg$pvalues)[i] <- names(X)[i] } if ("res_reg" %in% estimates) { res_reg_help <- res_reg(Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L) results_res_reg$point_estimates[i] <- res_reg_help$point_estimates[5] results_res_reg$SE_estimates[i] <- res_reg_help$SE_estimates[5] results_res_reg$pvalues[i] <- res_reg_help$pvalues[5] names(results_res_reg$point_estimates)[i] <-
names(results_res_reg$SE_estimates)[i] <- names(results_res_reg$pvalues)[i] <- names(X)[i]
}
if ("ee" %in% estimates) {
point_estimates_ee <- get_estimates(setting = setting,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L)
if (ee_se == "sandwich") {
# Obtain estimating functions expressions
estfunct <- est_funct_expr(setting = "GLM")
# Obtain matrices with all first and second derivatives
derivobj <- deriv_obj(setting = setting,
logL1 = estfunct$logL1, logL2 = estfunct$logL2,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L,
estimates = point_estimates_ee)
# Obtain score and hessian matrices
results_scores <- scores(derivobj)
results_hessian <- hessian(derivobj)
# Obtain sandwich standard error estimates of the parameters
se_estimates_ee <- sandwich_se(setting = setting,
scores = results_scores,
hessian = results_hessian)
}
if (ee_se == "bootstrap") {
se_estimates_ee <- bootstrap_se(setting = setting,
BS_rep = BS_rep,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L)
}
if (ee_se == "naive") {
se_estimates_ee <- naive_se(setting = setting,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L)
}
results_ee$point_estimates[i] <- point_estimates_ee[7] results_ee$SE_estimates[i] <- se_estimates_ee[7]
results_ee$wald_test_stat[i] <- point_estimates_ee[7]/ se_estimates_ee[7] results_ee$pvalues[i] <- 2 * pnorm(-abs(point_estimates_ee[7]/
se_estimates_ee[7]))
names(results_ee$point_estimates)[i] <- names(results_ee$SE_estimates)[i] <-
names(results_ee$wald_test_stat)[i] <- names(results_ee$pvalues)[i] <- names(X)[i]
}
}
if (setting == "AFT") {
data_help <- data.frame(Y = Y, X = Xi, K = K, L = L, C = C)
data_help <- data_help[complete.cases(data_help), ]
if ("mult_reg" %in% estimates) {
mult_reg_help <- mult_reg(setting = setting, Y = data_help$Y, X = data_help$X, K = data_help$K, L = data_help$L, C = data_help$C) results_mult_reg$point_estimates[i] <- mult_reg_help$point_estimates[3] results_mult_reg$SE_estimates[i] <- mult_reg_help$SE_estimates[3] results_mult_reg$pvalues[i] <- mult_reg_help$pvalues[3] names(results_mult_reg$point_estimates)[i] <-
names(results_mult_reg$SE_estimates)[i] <- names(results_mult_reg$pvalues)[i] <- names(X)[i]
}
if ("ee" %in% estimates) {
point_estimates_ee <- get_estimates(setting = setting,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L,
C = data_help$C) if (ee_se == "sandwich") { # Obtain estimating functions expressions estfunct <- est_funct_expr(setting = setting) # Obtain matrices with all first and second derivatives derivobj <- deriv_obj(setting = setting, logL1 = estfunct$logL1,
logL2 = estfunct$logL2, Y = data_help$Y,
X = data_help$X, K = data_help$K,
L = data_help$L, C = data_help$C,
estimates = point_estimates_ee)
# Obtain score and hessian matrices
results_scores <- scores(derivobj)
results_hessian <- hessian(derivobj)
# Obtain sandwich standard error estimates of the parameters
se_estimates_ee <- sandwich_se(setting = setting,
scores = results_scores,
hessian = results_hessian)
}
if (ee_se == "bootstrap") {
se_estimates_ee <- bootstrap_se(setting = setting,
BS_rep = BS_rep,
Y = data_help$Y, X = data_help$X,
K = data_help$K, L = data_help$L,
C = data_help$C) } if (ee_se == "naive") { se_estimates_ee <- naive_se(setting = setting, Y = data_help$Y,
X = data_help$X, K = data_help$K,
L = data_help$L, C = data_help$C)
}
results_ee$point_estimates[i] <- point_estimates_ee[7] results_ee$SE_estimates[i] <- se_estimates_ee[7]
results_ee$wald_test_stat[i] <- point_estimates_ee[7]/ se_estimates_ee[7] results_ee$pvalues[i] <- 2 * pnorm(-abs(point_estimates_ee[7]/
se_estimates_ee[7]))
names(results_ee$point_estimates)[i] <- names(results_ee$SE_estimates)[i] <-
names(results_ee$wald_test_stat)[i] <- names(results_ee$pvalues)[i] <- names(X)[i]
}
}
}
output <- list()
if ("ee" %in% estimates) {
output$results_ee <- results_ee } if ("mult_reg" %in% estimates) { output$results_mult_reg <- results_mult_reg
}
if ("res_reg" %in% estimates) {
output$results_res_reg <- results_res_reg } if ("sem" %in% estimates) { output$results_sem <- results_sem
}
class(output) <- "ciee"
return(output)
}
results_ciee <- ciee(setting = "GLM", Y = dat$Y, X = dat$X, K = dat$K, L = dat$L,
estimates = c("ee", "mult_reg", "res_reg", "sem"),
ee_se = "sandwich")
results_ciee

maf <- 0.2
n <- 1000
dat <- generate_data(n = n, maf = maf)
datX <- data.frame(X = dat$X) names(datX)[1] <- "X1" for(i in 2:10){ X <- rbinom(n, size = 2, prob = maf) datX$X <- X
names(datX)[i] <- paste("X", i, sep="")
}
results_ciee_loop <- ciee_loop(setting = "GLM", Y = dat$Y, X = datX, K = dat$K, L = dat$L) results_ciee_loop Both ciee() and ciee_loop() return ciee objects as output, and the implemented summary.ciee() function can be used through the generic summary() to provide a reader-friendly formatted output of the results. summary.ciee <- function(results = NULL) { if (is.null(results)) { stop("ciee output has to be supplied.") } res_out <- NULL if ("results_ee" %in% names(results)) { res_ee_out <- data.frame(point_estimates = results$results_ee$point_estimates, SE_estimates = results$results_ee$SE_estimates, wald_test_stat = results$results_ee$wald_test_stat, pvalues = results$results_ee$pvalues) rownames(res_ee_out) <- paste("CIEE", rownames(res_ee_out), sep = "_") print(paste("Results based on estimating equations.")) print(res_ee_out) res_out <- res_ee_out[,c(1,2,4)] } if ("results_mult_reg" %in% names(results)) { res_mr_out <- data.frame(point_estimates = results$results_mult_reg$point_estimates, SE_estimates = results$results_mult_reg$SE_estimates, pvalues = results$results_mult_reg$pvalues) rownames(res_mr_out) <- paste("MR", rownames(res_mr_out), sep = "_") print(paste("Results based on traditional multiple regression.")) print(res_mr_out) res_out <- rbind(res_out, res_mr_out) } if ("results_res_reg" %in% names(results)) { res_rr_out <- data.frame(point_estimates = results$results_res_reg$point_estimates, SE_estimates = results$results_res_reg$SE_estimates, pvalues = results$results_res_reg$pvalues) rownames(res_rr_out) <- paste("RR", rownames(res_rr_out), sep = "_") print(paste("Results based on traditional regression of residuals.")) print(res_rr_out) res_out <- rbind(res_out, res_rr_out) } if ("results_sem" %in% names(results)) { res_sem_out <- data.frame(point_estimates = results$results_sem$point_estimates, SE_estimates = results$results_sem$SE_estimates, pvalues = results$results_sem\$pvalues)
rownames(res_sem_out) <- paste("SEM", rownames(res_sem_out), sep = "_")
print(paste("Results based on structural equation modeling."))
print(res_sem_out)
res_out <- rbind(res_out, res_sem_out)
}
invisible(res_out)
}
summary(results_ciee)
summary(results_ciee_loop)

## References

Bollen KA (1989). Structural equations with latent variables. New York: John Wiley & Sons.

Konigorski S, Wang Y, Cigsar C, Yilmaz YE (2018). Estimating and testing direct genetic effects in directed acyclic graphs using estimating equations. Genetic Epidemiology, 42: 174-186.

Rosseel Y (2012). lavaan: an R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.

Vansteelandt S, Goetgeluk S, Lutz S, et al. (2009). On the adjustment for covariates in genetic association analysis: a novel, simple principle to infer direct causal effects. Genetic Epidemiology, 33, 394-405.

## Try the CIEE package in your browser

Any scripts or data that you put into this service are public.

CIEE documentation built on May 2, 2019, 6:39 a.m.