gComp: Estimate difference and ratio effects with 95% confidence...

View source: R/gComp.R

gCompR Documentation

Estimate difference and ratio effects with 95% confidence intervals.

Description

Obtain a point estimate and 95% confidence interval for difference and ratio effects comparing exposed and unexposed (or treatment and non-treatment) groups using g-computation.

Usage

gComp(
  data,
  outcome.type = c("binary", "count", "count_nb", "rate", "rate_nb", "continuous"),
  formula = NULL,
  Y = NULL,
  X = NULL,
  Z = NULL,
  subgroup = NULL,
  offset = NULL,
  rate.multiplier = 1,
  exposure.scalar = 1,
  R = 200,
  clusterID = NULL,
  parallel = "no",
  ncpus = getOption("boot.ncpus", 1L)
)

Arguments

data

(Required) A data.frame containing variables for Y, X, and Z or with variables matching the model variables specified in a user-supplied formula. Data set should also contain variables for the optional subgroup and offset, if they are specified.

outcome.type

(Required) Character argument to describe the outcome type. Acceptable responses, and the corresponding error distribution and link function used in the glm, include:

binary

(Default) A binomial distribution with link = 'logit' is used.

count

A Poisson distribution with link = 'log' is used.

count_nb

A negative binomial model with link = 'log' is used, where the theta parameter is estimated internally; ideal for over-dispersed count data.

rate

A Poisson distribution with link = 'log' is used; ideal for events/person-time outcomes.

rate_nb

A negative binomial model with link = 'log' is used, where the theta parameter is estimated internally; ideal for over-dispersed events/person-time outcomes.

continuous

A gaussian distribution with link = 'identity' is used.

formula

(Optional) Default NULL. An object of class "formula" (or one that can be coerced to that class) which provides the the complete model formula, similar to the formula for the glm function in R (e.g. 'Y ~ X + Z1 + Z2 + Z3'). Can be supplied as a character or formula object. If no formula is provided, Y and X must be provided.

Y

(Optional) Default NULL. Character argument which specifies the outcome variable. Can optionally provide a formula instead of Y and X variables.

X

(Optional) Default NULL. Character argument which specifies the exposure variable (or treatment group assignment), which can be binary, categorical, or continuous. This variable can be supplied as a factor variable (for binary or categorical exposures) or a continuous variable. For binary/categorical exposures, X should be supplied as a factor with the lowest level set to the desired referent. Numeric variables are accepted, but will be centered (see Note). Character variables are not accepted and will throw an error. Can optionally provide a formula instead of Y and X variables.

Z

(Optional) Default NULL. List or single character vector which specifies the names of covariates or other variables to adjust for in the glm function. All variables should either be factors, continuous, or coded 0/1 (i.e. not character variables). Does not allow interaction terms.

subgroup

(Optional) Default NULL. Character argument that indicates subgroups for stratified analysis. Effects will be reported for each category of the subgroup variable. Variable will be automatically converted to a factor if not already.

offset

(Optional, only applicable for rate/count outcomes) Default NULL. Character argument which specifies the variable name to be used as the person-time denominator for rate outcomes to be included as an offset in the Poisson regression model. Numeric variable should be on the linear scale; function will take natural log before including in the model.

rate.multiplier

(Optional, only applicable for rate/count outcomes). Default 1. Numeric variable signifying the person-time value to use in predictions; the offset variable will be set to this when predicting under the counterfactual conditions. This value should be set to the person-time denominator desired for the rate difference measure and must be inputted in the units of the original offset variable (e.g. if the offset variable is in days and the desired rate difference is the rate per 100 person-years, rate.multiplier should be inputted as 365.25*100).

exposure.scalar

(Optional, only applicable for continuous exposure) Default 1. Numeric value to scale effects with a continuous exposure. This option facilitates reporting effects for an interpretable contrast (i.e. magnitude of difference) within the continuous exposure. For example, if the continuous exposure is age in years, a multiplier of 10 would result in estimates per 10-year increase in age rather than per a 1-year increase in age.

R

(Optional) Default 200. The number of data resamples to be conducted to produce the bootstrap confidence interval of the estimate.

clusterID

(Optional) Default NULL. Character argument which specifies the variable name for the unique identifier for clusters. This option specifies that clustering should be accounted for in the calculation of confidence intervals. The clusterID will be used as the level for resampling in the bootstrap procedure.

parallel

(Optional) Default "no." The type of parallel operation to be used. Available options (besides the default of no parallel processing) include "multicore" (not available for Windows) or "snow." This argument is passed directly to boot. See note below about setting seeds and parallel computing.

ncpus

(Optional, only used if parallel is set to "multicore" or "snow") Default 1. Integer argument for the number of CPUs available for parallel processing/ number of parallel operations to be used. This argument is passed directly to boot

Details

The gComp function executes the following steps:

  1. Calls the pointEstimate function on the data to obtain the appropriate effect estimates (difference, ratio, etc.).

  2. Generates R bootstrap resamples of the data, with replacement. If the resampling is to be done at the cluster level (set using the clusterID argument), the number of clusters will remain constant but the total number of observations in each resampled data set might be different if clusters are not balanced.

  3. Calls the pointEstimate function on each of the resampled data sets.

  4. Calculates the 95% confidence interval of the difference and ratio estimates using the results obtained from the R resampled parameter estimates.

As bootstrap resamples are generated with random sampling, users should set a seed (set.seed for reproducible confidence intervals.

While offsets are used to account for differences in follow-up time between individuals in the glm model, rate differences are calculated assuming equivalent follow-up of all individuals (i.e. predictions for each exposure are based on all observations having the same offset value). The default is 1 (specifying 1 unit of the original offset variable) or the user can specify an offset to be used in the predictions with the rate.multiplier argument.

Value

An object of class gComp which is a named list with components:

$summary

Summary providing parameter estimates and 95% confidence limits of the outcome difference and ratio (in a print-pretty format)

$results.df

Data.frame with parameter estimates, 2.5% confidence limit, and 97.5% confidence limit each as a column (which can be used for easy incorporation into tables for publication)

$n

Number of unique observations in the original dataset

$R

Number of bootstrap iterations

$boot.result

Data.frame containing the results of the R bootstrap iterations of the g-computation

$contrast

Contrast levels compared

$family

Error distribution used in the model

$formula

Model formula used to fit the glm

$predicted.outcome

A data.frame with the marginal mean predicted outcomes (with 95% confidence limits) for each exposure level (i.e. under both exposed and unexposed counterfactual predictions)

$glm.result

The glm class object returned from the fitted regression of the outcome on the exposure and relevant covariates.

Note

Note that for a protective exposure (risk difference less than 0), the 'Number needed to treat/harm' is interpreted as the number needed to treat, and for a harmful exposure (risk difference greater than 0), it is interpreted as the number needed to harm. Note also that confidence intervals are not reported for the number needed to treat/harm. If the confidence interval (CI) for the risk difference crosses the null, the construction of the CI for the number needed to treat/harm is not well defined. Challenges and options for reporting the number needed to treat/harm CI are reviewed extensively in Altman 1998, Hutton 2000, and Stang 2010, with a consensus that an appropriate interval would have two segments, one bounded at negative infinity and the other at positive infinity. Because the number needed to treat/harm is most useful as a communication tool and is directly derived from the risk difference, which has a CI that provides a more interpretable measure of precision, we do not report the CI for the number needed to treat/harm. If the CI of the risk difference does not cross the null, the number needed to treat/harm CI can be calculated straightforwardly by taking the inverse of each confidence bound of the risk difference.

For continuous exposure variables, the default effects are provided for a one unit difference in the exposure at the mean value of the exposure variable. Because the underlying parametric model for a binary outcome is logistic regression, the risks for a continuous exposure will be estimated to be linear on the log-odds (logit) scale, such that the odds ratio for any one unit increase in the continuous variable is constant. However, the risks will not be linear on the linear (risk difference) or log (risk ratio) scales, such that these parameters will not be constant across the range of the continuous exposure. Users should be aware that the risk difference, risk ratio, number needed to treat/harm (for a binary outcome) and the incidence rate difference (for a rate/count outcome) reported with a continuous exposure apply specifically at the mean of the continuous exposure. The effects do not necessarily apply across the entire range of the variable. However, variations in the effect are likely small, especially near the mean.

Interaction terms are not allowed in the model formula. The subgroup argument affords interaction between the exposure variable and a single covariate (that is forced to categorical if supplied as numeric) to estimate effects of the exposure within subgroups defined by the interacting covariate. To include additional interaction terms with variables other than the exposure, we recommend that users create the interaction term as a cross-product of the two interaction variables in a data cleaning step prior to running the model.

The documentation for boot includes details about reproducible seeds when using parallel computing.

References

Ahern J, Hubbard A, Galea S. Estimating the effects of potential public health interventions on population disease burden: a step-by-step illustration of causal inference methods. Am. J. Epidemiol. 2009;169(9):1140–1147. doi: 10.1093/aje/kwp015

Altman DG, Deeks JJ, Sackett DL. Odds ratios should be avoided when events are common. BMJ. 1998;317(7168):1318. doi: 10.1136/bmj.317.7168.1318

Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. Book link

Hutton JL. Number needed to treat: properties and problems. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2000;163(3):381–402. doi: 10.1111/1467-985X.00175

Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7(9):1393–1512. doi: 10.1016/0270-0255(86)90088-6

Snowden JM, Rose S, Mortimer KM. Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. Am. J. Epidemiol. 2011;173(7):731–738. doi: 10.1093/aje/kwq472

Stang A, Poole C, Bender R. Common problems related to the use of number needed to treat. Journal of Clinical Epidemiology. 2010;63(8):820–825. doi: 10.1016/j.jclinepi.2009.08.006

Westreich D, Cole SR, Young JG, et al. The parametric g-formula to estimate the effect of highly active antiretroviral therapy on incident AIDS or death. Stat Med. 2012;31(18):2000–2009. doi: 10.1002/sim.5316

See Also

pointEstimate boot

Examples

## Obtain the risk difference and risk ratio for cardiovascular disease or death between
## patients with and without diabetes.
data(cvdd)
set.seed(538)
diabetes <- gComp(cvdd, formula = "cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP",
outcome.type = "binary", R = 20)


riskCommunicator documentation built on June 1, 2022, 1:07 a.m.