# load other algorithm packages to clean up the output for agecurveAb(), below
require(tmle)
require(SuperLearner)
require(randomForest)
require(arm)
require(gam)
require(polspline)
require(glmnet)
require(scales)

Overview

This vignette describes the main functions in the tmleAb package. The software is written in R and is developed on GitHub. We also recommend the free RStudio computing environment.

This package is based on work funded by the National Institute of Allergy and Infectius Diseases grant K01-AI119180.

tmleAb is first and foremost a convenience wrapper for two related R packages (both required): SuperLearner and tmle both on the CRAN repository. Gruber and van der Laan^[Gruber S, van der Laan M. tmle: An R Package for Targeted Maximum Likelihood Estimation. J Stat Softw. 2012;51: 1–35.] provide a helpful overview of the tmle package, and the SuperLearner GitHub page includes additional resources beyond the R package and the original article^[van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Stat Appl Genet Mol Biol. 2007;6: 1544–6115.].

At this time, there are two main functions in the package that will be most useful (details below):

In addition to these functions, the package also includes a number of antibody datasets:

Please see the article Measuring population changes in infectious disease transmission from quantitative antibody levels (Arnold et al. in review) for details about the methodology that underlies this software, as well as more information about the included datasets.

Istallation and Updates

You can install this package from GitHub using the devtools package in R.

library(devtools)
devtools::install_github("ben-arnold/tmleAb")

The package is in semi-active development. If you use the package and have run into issues and/or have suggestions for improvement feel free to write and we will try our best to respond!

agecurveAb

The agecurveAb function provides a convenient way to estimate an age-dependent antibody curve -- that is, an average antibody level by age in an population. To provide an example of how to use the function, we'll use the publically available Garki project data, which includes IFA antibody titres to Plasmodium falciparum measured in individuals living in 6 intervention villages and 2 control villages^[Molineaux L, Gramiccia G. The Garki Project [Internet]. World Health Organization; 1980. Available: http://garkiproject.nd.edu/static/documents/garkiproject.pdf] ^[Cornille-Brögger R, Mathews HM, Storey J, Ashkar TS, Brögger S, Molineaux L. Changing patterns in the humoral immune response to malaria before, during, and after the application of control measures: a longitudinal study in the West African savanna. Bull World Health Organ. 1978;56: 579–600.] A formatted version of the data is included in the package (?garki_sero). To simplify the exposition, we will focus only on the data collected in the active intervention period: three measurements took place after 20, 50, and 70 weeks of intervention:

# load the package into R
library(tmleAb)
# load the Garki serology data (included in the package)
data(garki_sero)
# and subset it to measurement rounds 3-5, ages <20, non-missing IFA Pf titres
d <- garki_sero
d <- subset(d,serosvy>=3 & serosvy<=5 & ageyrs<=20 & !is.na(ifatpftitre))
# create a log-transformed version of the IFA titres (add 1 for 0 values)
d$logpftitre <- log10(d$ifatpftitre+1)
# further subset the data by intervention group for convenience
dc <- d[d$tr=="Control" ,]
di <- d[d$tr=="Intervention",]

arguments

The agecurveAb function has a number of arguments, but these are the most important:

fitting curves

The parameter of interest is $E(Y_{a,x}) = E_W[E(Y|A=a, X=x, W)]$, for age $A$, group $X$ and covariates $W$. Important: agecurveAb does not separately stratify by group ($X$), and so it actually estimates $E(Y_{a}) = E_W[E(Y|A=a, W)]$, with stratification by group ($X$) achieved by estimating the curve on the stratified datasets.

Here is an example of how to fit separate age-antibody curves for the intervention ($X=1$) and control ($X=0$) villages in the Garki Project. The code below uses an algorithm library that includes the simple mean, a main terms generalized linear model, an antibody acquistion model proposed by Yman et al^[Yman V, White MT, Rono J, Arcà B, Osier FH, Troye-Blomberg M, et al. Antibody acquisition models: A new tool for serological surveillance of malaria transmission intensity. Sci Rep.; 2016;6: 19472.], generalized additive models with natural splines, and locally weighted regression:

# list of models and algorithms to include:
mylib <- c("SL.mean","SL.glm","SL.Yman2016","SL.gam","SL.loess")

# Intervention villages (X=1), fitted curve
set.seed(625234)
fit_i <- agecurveAb(Y=di$logpftitre,Age=di$ageyrs,id=di$id,SL.library=mylib)

Since the SuperLearner algorithm splits the data randomly by id for cross-validation, setting a seed before estimation ensures that you obtain perfectly reproducible results. The final, printed output from agecurveAb is the cross-validated risk of each algorithm included in the library (column 1) along with the weight given to the algorithm in the final prediction (column 2). "Risk" in this sense is not the epidemiologic concept of risk but instead the loss function. For quantitative antibody measurements, the loss function is the mean squared error (MSE) of the predicted antibody levels compared with the actual antibody levels.

In this particular example, generalized additive models (GAM) wins the day and is the only algorithm used in the prediction. The weights represent the optimal combination of the algorithm predictions with respect to the cross-validated risk (MSE). Since the algorithm is data-adaptive, the weighting can change depending on the dataset used or the particular mix of models/algorithms included in the library. For example, add random forest into the library:

mylib2 <- c("SL.mean","SL.glm","SL.Yman2016","SL.gam","SL.loess","SL.randomForest")
set.seed(625234)
fit_i2 <- agecurveAb(Y=di$logpftitre,Age=di$ageyrs,id=di$id,SL.library=mylib2)

With the addition of the new member to the ensemble, GAM still contributes to the super learner prediction, but the optimal combination also includes information from random forest. We have found in practice that highly adaptive algorithms such as random forest or multivariate adaptive regression splines (MARS) tend to produce jagged age-antibody curves, so using flexible but smoother nonparametric algorithms like natural splines can provide consistent curves that are easier to compare across populations.

A similar curve can be fit in the control villages, $X=0$ (reverting to the original library that excludes random forest):

# Control villages (X=0), fitted curve
fit_c <- agecurveAb(Y=dc$logpftitre,Age=dc$ageyrs,id=dc$id,SL.library=mylib)

As with the intervention villages, GAM is given the full weight in the super learner prediction. This is not at all true in general -- oftentimes the optimal combination of models/algorithms in the library weights many of them in the overall prediction. As we will see below, the age-antibody curves in this population are highly nonlinear, and the only algorithm in this library that can accommodate that is GAM. The important point is that it is impossible to know ahead of time what the optimal combination would be -- the data-adaptive machine learning provides an automated solution and investigators can remain agnostic about which model/algorithm to rely on in the curve fitting.

returned values

agecurveAb returns a list of objects that includes the input (feature matrix) used for estimation, along with fitted results (pY). Note that the objects returned from agecurveAb exclude observations with missing values in Y, Age, W (if not NULL), and id (if specified). Also note that factors in W are converted to design-matrix-style indicator variables. The objects are sorted by Age to make it easier to plot results. If the call includes covariates W, then pY values are marginally adjusted values at each age $A=a$.

visualizing the results

At this time, there is no default plotting method for agecurveAb objects (it is on the to-do list). But, we designed the function output to make plots relatively painless using the returned results. Simple plots are easy -- for example, compare control and intervention curves (fit above):

# grab a color blind friendly palette
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
cols <- cbPalette[c(7,6)]
# plot the curves
plot(fit_c$Age,fit_c$pY,type="l",col=cols[1],
    main="Age-dependent antibody curves\nin control and intervention villages",
    xlim=c(0,20),xlab="Age, years",
    ylim=c(0,4),ylab="Log10 P. Falciparum IFA antibody titre",
    bty="n",las=1)
lines(fit_i$Age,fit_i$pY,col=cols[2],lty=2)
legend("topleft",legend=c("Control","Intervention"),col=cols,lty=c(1,2),bty="n",horiz=T)

Or compare the different ensemble library fits in the intervention villages, and show the underlying data

# plot the curves
cols <- cbPalette[c(1,7)]
plot(fit_i2$Age,fit_i2$pY,type="l",col=cols[1],
    main="Age-dependent antibody curves\nin intervention villages with different ensembles",
    xlim=c(0,20),xlab="Age, years",
    ylim=c(0,4),ylab="Log10 P. Falciparum IFA antibody titre",
    bty="n",las=1)
lines(fit_i$Age,fit_i$pY,col=cols[2],lty=1)
legend("topleft",legend=c("Incl. random forest","Excl. random forest"),col=cols,lty=c(1,1),bty="n",horiz=T)

# add the underlying raw data points as well
library(scales)
points(fit_i$Age,fit_i$Y,pch=16,cex=0.2,col=alpha("black",alpha=0.5))

In this example, the restricted library closely approximates the more jagged prediction using the larger library that included random forest.

marginally adjusted curves

In some settings, such as evaluating programs that were not randomly allocated, you might want to adjust the curves for potentially confounding covariates W. This is straightforward: simply pass a dataframe into the W argument to agecurveAb. For example, we could estimate a marginally adjusted age-antibody curve in the intervention villages, additionally adjusting for sex, season (wet vs. dry), and village:

# fit the age-antibody curve in intervention villages, adjusting for sex, season, and village
fit_iadj <- agecurveAb(Y=di$logpftitre,Age=di$ageyrs,id=di$id,
                       W=subset(di,select=c("sex","wetseason","vname")),
                       SL.library=mylib)

Although the cross-validated MSE ("risk") is slightly lower after adjusting for the additional covariates, the mean age-antibody relationship is very similar after adjustment.

# compare the results with the unadjusted curve
cols <- cbPalette[c(1,7)]
plot(fit_i$Age,fit_i$pY,type="l",col=cols[1],
    main="Unadjusted and marginally adjusted\nage-dependent antibody curves in intervention villages",
    xlim=c(0,20),xlab="Age, years",
    ylim=c(0,4),ylab="Log10 P. Falciparum IFA antibody titre",
    bty="n",las=1)
lines(fit_iadj$Age,fit_iadj$pY,col=cols[2],lty=1)
legend("topleft",legend=c("Unadjusted","Adjusted"),col=cols,lty=c(1,1),bty="n",horiz=T)

tmleAb

Fitting age-antibody curves provides a visual comparison between groups, and tmleAb provides statistical inference for differences between the curves. Group means are a smooth functional of the curve, which allows for asymptotically normally distributed and efficient estimators. One such estimator is targeted maximum likelihood estimation (TMLE)^[van der Laan MJ, Rubin D. Targeted Maximum Likelihood Learning. Int J Biostat. 2006;2: 1–38.] ^[van der Laan MJ, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics; 2011.]. An important feature of TMLE is that it still provides correct inference while incorporating data-adaptive machine learning in the estimation step.

Here, the parameter of interest is the group mean or difference in means, marginally averaged over age and potentially other covariates: $E(Y_x) = E_{A,W}[E(Y|X=x,A,W)]$, for age $A$, group $X$ and covariates $W$. In fact, estimating $E(Y_x)$ is the same as calculating the area under the age-antibody curves estimated with agecurveAb.

arguments

The tmleAb function has (by design) almost identical arguments to agecurveAb. These are the most important arguments:

If X=NULL, then tmleAb estimates the marginal mean antibody level along with influence curve based standard errors. If X includes a binary indicator for groups, then tmleAb estimates the difference between groups and provides inference for the difference.

estimate group means

Above in the agecurveAb example, we used the P. falciparum IFA test results in control and intervention villages. Using the notation above, intervention is denoted $X=1$ and control is $X=0$. Here, we illustrate how to estimate group means and in the next section a difference in means.

# list of models and algorithms to include:
mylib <- c("SL.mean","SL.glm","SL.Yman2016","SL.gam","SL.loess")

# Intervention villages (X=1)
set.seed(625234)
Wi <- data.frame(Age=di$ageyrs)
mu_i <- tmleAb(Y=di$logpftitre,X=NULL,W=Wi,id=di$id,SL.library=mylib)

# Control villages (X=0)
Wc <- data.frame(Age=dc$ageyrs)
mu_c <- tmleAb(Y=dc$logpftitre,X=NULL,W=Wc,id=dc$id,SL.library=mylib)

estimate difference in means

The age-antibody curves in control and intervention villages look different, and their means (a summary of the curve) also look different. We can formally test for differences using tmleAb. Note that in the examples above, it was convenient to rely on stratified datasets (di, dc) to separately estimate parameters in control and intervention villages. But, to compare groups you need to have a single dataset, with a binary indicator to identify groups. If you wish to compare many different groups, you can do it by repeating the call to tmleAb using separate indicator variables that contrast each comparison.

# make a binary indicator variable for intervention vs. control
d$tr01 <- ifelse(d$tr=="Intervention",1,0)
W <- data.frame(Age=d$ageyrs)
# estimate the difference between groups
psi_diff <- tmleAb(Y=d$logpftitre,X=d$tr01,W=W,id=d$id,SL.library=mylib)

returned values

The tmleAb function returns a list that includes the following objects:

Additional functions

The tmleAb package includes some additional functions that are listed on the reference page. Most are internal functions that you can ignore, but some are useful. If you wish to evaluate the predictive performance of the Super Learner and its constituent algorithms, then the cvSLAb() function provides a convenient wrapper for the CV.SuperLearner() routine. If you wish to fit age-dependent antibody acquisition models, then the Yman2016() function enables this.

References



ben-arnold/tmleAb documentation built on May 12, 2019, 10:55 a.m.