In UW-MSDS-DATA-598-Reproducibility-WI20/miao-qian-wang-wei-yang-replication-project: Replication project

Introduction

Generalized linear model (GLM) is a widely used method in research. It is well-established with high flexibility and relatively straightforward interpretation. However, GLMs usually require making assumptions and having the underlying assumptions not hold could invalidate the whole model.

The application of GLM can be observed in many fields including public health. @quattrochi2019 introduced an GLM via penalized maximum likelihood to correct the under-5 mortality (U5M) in populations affected by HIV/AIDS. This method was developed since the U5M is commonly estimated using survey-based approaches in less developed regions and it is based on a fundamental assumption that the mother's survival is not related to her children's, which is often violated in populations affected by HIV/AIDS. Therefore, the author sought a method to correct this bias using GLM.

In our work, we present our endeavor to replicate the model building and regression analysis introduced in @quattrochi2019. In the meantime, we attempted using different parameters in both constructing the model and applying it to test how robust the original method was. We will in this paper report our method and results from the replication attempt.

knitr::opts_chunk$set(echo = TRUE)

Methods

if(!require(tidyverse)){
  install.packages('http://cran.r-project.org/src/contrib/Archive/tidyverse/tidyverse_1.2.1.tar.gz',repos=NULL, type="source")
}
packageVersion("tidyverse")
require(tidyverse)
# ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
# ✔ ggplot2 3.1.0       ✔ purrr   0.3.0  
# ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
# ✔ tidyr   0.8.2       ✔ stringr 1.4.0  
# ✔ readr   1.1.1       ✔ forcats 0.3.0  
# ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()

if(!require(iterators)){
  install.packages('https://cran.r-project.org/src/contrib/Archive/iterators/iterators_1.0.10.tar.gz',repos=NULL, type="source")
}
packageVersion("iterators")
# 1.0.10
require(iterators)

if(!require(foreach)){
  install.packages('https://cran.r-project.org/src/contrib/Archive/foreach/foreach_1.4.4.tar.gz',repos=NULL, type="source")
}
packageVersion("foreach")
# 1.4.4
require(foreach)

if(!require(bcaboot)){
  install.packages("bcaboot",dependencies = TRUE,repos='http://cran.us.r-project.org')
}
require(bcaboot)
packageVersion("bcaboot")
# 0.2.1

if(!require(data.table)){
  install.packages("data.table",dependencies = TRUE,repos='http://cran.us.r-project.org')
}
require(data.table)
packageVersion("data.table")
# 1.11.8

if(!require(pls)){
  install.packages("pls",dependencies = TRUE,repos='http://cran.us.r-project.org')
}
require(pls)
packageVersion("pls")
# 2.7.0

if(!require(glmnet)){
  install.packages('https://cran.r-project.org/src/contrib/Archive/glmnet/glmnet_2.0-16.tar.gz',repos=NULL, type="source")
}
require(glmnet)
packageVersion("glmnet")
# 2.0.16

if(!require(officer)){
  install.packages("officer",dependencies = TRUE,repos='http://cran.us.r-project.org')
}
require(officer)
packageVersion("officer")
# 0.3.2

if(!require(flextable)){
  install.packages("flextable",dependencies = TRUE,repos='http://cran.us.r-project.org')
}
require(flextable)
packageVersion("flextable")
# 0.5.1


if(!require(magrittr)){
  install.packages("magrittr",dependencies = TRUE,repos='http://cran.us.r-project.org')
}
require(magrittr)
packageVersion("magrittr")
# 1.5

Overview

The original model was implemented in R (v3.5.2) [team2013r] with web application (https://johnquattrochi.com/bias/) to show that indirect method such as surveying can underestimate U5M by 0-41% in populations where 0-40% of the people are affected by HIV/AIDS. The model was written using the R language [team2013r] and some of the main packages used were tidyverse (v1.2.1) and foreach (v1.4.4). The original implementation utilized the doReids package (v1.1.1) that launches a back-end server in Docker for parallel processing that was not adopted in our replication. The main limitation is the computing resource we had for this project within a short period of time. Based on experiments performed on a m4.xlarge EC2 instance on AWS, the original simulation settings in the target paper consumed over 63% CPU and 4GB of memory and kept running for over 3 days using Docker doRedis package. The operator estimated of finishing the entire simulation step would require an 8-core CPU machine more than 1 week. In order to complete the simulation for constructing the model in a reasonable time frame without parallel computing, we had to stop the simulation and reduced the number of populations used in the simulation input from 4480 to 1000. The code provided by the author on Github (https://github.com/jquattro/hiv-childmort-bias) was mostly well-organized and documented. However, we discovered some mistakes regarding output inconsistency and wrong variable names. Despite the debugging process, we were able to replicate the implementation.

Model Description

The goal of the large-scale simulation used in the original paper was to build a demographic history that accounted for the variability among actual populations to measure if biases such as the independence assumption of mother and children survival rate in a population influenced by HIV/AIDS existed. @quattrochi2019 used stochastic simulation methods to generate a set of populations with input parameters including demographic features, mortality rate for U5M, maternal mortality rate, HIV prevalence, etc. The input simulation parameters vary over a large interval for different age groups across different populations. In each time step of one year, each woman in a population has some probabilities to give birth (for women between 15 to 49 years old), being infected with HIV, initiating ART (if HIV-positive), and dying given the simulation parameters. Through the aging process in the simulation population, children born with infected HIV mothers will have a probability of getting infected while all children will have a probability of mortality.

For calculating relative and absolute bias, we used the formulas provided in the original paper. $$\mbox{Relative bias} = \frac{IE_{survivors} - IE_{survivors,HIV deaths}}{IE_{survivors,HIV deaths}}$$

$$\mbox{Absolute bias} = IE_{survivors} - IE_{survivors,HIV deaths}$$ According to [@quattrochi2019], $IE_{survivor}$ was the estimattion of U5M based on living women in 2010, and $IE_{survivors,HIV deaths}$ was the estimation of U5M based on living women in 2010 and women died from HIV/AIDS before 2010 given the condition that they would have been 15-49 in 2010 if they had survived. $a_{d}$

Simulation Construction

# This section of code is designed to be no executable as it takes a long time to generate simulations.
# Instead, input data was provided as .RData files

##### DEFINE FOLDERS #######

if(!dir.exists("results/figdata")){
  dir.create("results/figdata",recursive = TRUE)
}

if(!dir.exists("results/regdata")){
  dir.create("results/regdata",recursive = TRUE)
}

# Simulation results will be stored here

if(!dir.exists("results/models")){
  dir.create("results/models",recursive = TRUE)
}

##### SIMULATION #####

# Read data

hivhogan = read.csv("data/inc_curves.csv",head=TRUE)
mort_series = read.csv("data/IHME_female_mortSMALL.csv",head=TRUE)
adultmort = read.csv("data/MLTfemSMALL.csv",head=TRUE)
worldfert= read.csv("data/world_fert.csv",head=TRUE)
tfr_series= read.csv("data/tfr_gapminder_long.csv",head=TRUE)
art_series= read.csv("data/sampleART.csv",head=TRUE)
u5m_edit= read.csv("data/u5m_edit.csv",head=TRUE)
matmort = read.csv("data/matmort.csv",head=TRUE)

# Add HIV-free populations to hivhogan
hivhogan$Country <- factor(hivhogan$Country,levels=c(levels(hivhogan$Country),"zero"))
hivhogan[63,1] = "zero"
hivhogan[63,c(2:47)]=0

# Create parameters sets

fertcountry <- c("Botswana","Uganda")
cm_cntry <-  c("Mali","Morocco")
am_cntry <-  c("Madagascar","Sudan")
sexactive15 <-  c(30,70)
mmr0 <- c(0.0012,0.012)
mmr_dec <- c(0,0.073) # annual percent decline in MMR
curve <- c("BotswanaUrban","LesothoRural","MalawiRural","UgandaRural","CamerounRural","zero","BotUrb2x")
bfeed <- c(6,18)
art_col <- c("zero","Botswana","Cameroon","Malawi","Bot_dub")
growth <-  c(0.03)
yrend <- 2010
# Starting year
yrstart <- 1906

# CD4 threshold

threshold <- 200


# expand.grid() will create dataset with unique combinations in each row

inputs <- expand.grid(fertcountry, cm_cntry, am_cntry, sexactive15,mmr0,mmr_dec,curve,bfeed,art_col,growth,yrend,yrstart,threshold)
names(inputs) = c("fertcountry", "cm_cntry","am_cntry","sexactive15","mmr0","mmr_dec","curve","bfeed","art_col","growth","yrend","yrstart","threshold")

# Save the inputs set for later

save(inputs,file = "results/models/inputs.RData")

file_number_format <- paste0("%0",nchar(as.character(nrow(inputs))),"d")


# Set size of initial population

initial_populations<- c(7500,15000,22500)
years <- c(1906:2010)
ages <- c(0:120)

# Run the simulations for each initial population

for(ip in initial_populations){

  # Parameter sets already processed
  #ip = 150
  psets_already <- as.integer(gsub("models|\\.Rdata","",list.files(paste0("./results/models/p",ip),pattern = "models\\d+",full.names = FALSE)))

  psets_to_process <- setdiff(1:nrow(inputs),psets_already)

  # Run simulation for each set of parameters

  foreach(pset=psets_to_process,.packages = c("dplyr")) %dopar%{

    inp <- inputs[pset,]

    source("R/simulation_functions.R")

    source("R/indirect_estimates_functions.R")



    # Run simmulation
    set.seed(1)
    results <- bigsim(inp,initialpop=ip,years,ages,hivhogan,mort_series,adultmort,worldfert,tfr_series,art_series,u5m_edit,matmort)

    # Save simulation results
    save(results,file=file.path(paste0("./results/models/p",ip),paste("models",sprintf(file_number_format,pset),".Rdata",sep="")))
  }

}

# Collect hiv2000 differences among initial populations

# For each input row

hiv2000 <- foreach(pset=1:1000,.packages = c("dplyr")) %dopar%{

  # And each initial population
  paste0("p",initial_populations) %>% lapply(function(popini){

    # Load the results file for that initial population and input set

    load(file.path("results/models/",popini,paste("models",sprintf(file_number_format,pset),".Rdata",sep="")))
    print(paste(initial_populations,pset))
    # Extract the value of hiv2000
    r <- results[[1]]$hiv2000

    names(r) <- popini
    r
  }) %>% unlist() %>% t %>% data.frame(input_set=pset,.) # Put everything in one data.frame


} %>% bind_rows

# Compute difference

hiv2000 <- hiv2000 %>% mutate(p7500_minus_p15000=p7500-p15000,p15000_minus_p22500=p15000-p22500)

save(hiv2000,file = "results/hiv2000.RData")



##### INDIRECT ESTIMATES FOR INITAL POPULATION 22500 #####

# See if we have something not processed

psets_already <- intersect(as.integer(gsub("figdata.|\\.Rdata","",list.files("./results/figdata/p22500",pattern = "figdata.\\d+",full.names = FALSE))),
                           as.integer(gsub("regdata.|\\.Rdata","",list.files("./results/regdata/p22500",pattern = "regdata.\\d+",full.names = FALSE))))

psets_to_process <- setdiff(1:nrow(inputs),psets_already)

# For each input set not processed

foreach(pset=1:1000,.packages = c("dplyr")) %dopar%{

  source("R/indirect_estimates_functions_revised.R")

  load(file.path("results/models/p22500/",paste("models",sprintf(file_number_format,pset),".Rdata",sep="")))
  print(pset)

  w <- results[[7]]

  # Are babies of HIV moms getting HIV?
  # What percent of HIV moms babies have HIV?
  # How many born with HIV? 
  # df = as.data.frame(w)
  # y <- subset(df, hiv_date==2009)
  # plot(df$dob, df$hiv_date)  
  ##################################################  

  # format data
  momkidsclean <- as.data.frame(w)

  # Using all surviving women
  isf <-ind_est(momkidsclean)

  # Using all women
  isf_all <-ind_est_all(momkidsclean)

  #Using surviving women and women who died from HIV
  isf_hiv <- ind_est_hiv(momkidsclean)

  ie.three <- cbind(isf,isf_all,isf_hiv,pset)

  ###################################################################
  # Save results for regressions and for figures
  ###################################################################


  inp.plus <- as.data.frame(results[1])    
  regdata <- cbind(ie.three,inp.plus[1,], row.names = NULL) # Data for regressions
  figdata <- results[c(2:6,8:9)] # Data for figures


  save(regdata,file=file.path("results/regdata/p22500/",paste("regdata.",sprintf(file_number_format,pset),".Rdata",sep="")))
  save(figdata,file=file.path("results/figdata/p22500/",paste("figdata.",sprintf(file_number_format,pset),".Rdata",sep="")))
  print('Saved')
  rm(results,regdata,figdata,momkidsclean,isf,isf_all,isf_hiv,ie.three)
}

##### COLLECT DATA FROM ALL SIMULATIONS FOR INITIAL POPULATION 22500 #####


# Merge regdata

psets_to_process <- as.integer(gsub("regdata.|\\.Rdata","",list.files("./results/regdata/p22500",pattern = "regdata.\\d+",full.names = FALSE))) %>% sort

nbd2k <- psets_to_process %>% lapply(function(pset){

  filename <- file.path("results/regdata/p22500",paste("regdata.",sprintf(file_number_format,pset),".Rdata",sep=""))
  tryCatch({
    load(filename)

    regdata %>% mutate(i=pset) %>% select(i,everything())

  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}) %>% bind_rows() %>% as.data.frame()


save(nbd2k,file="results/regdata/p22500/regdata_all.Rdata")

# Merge figdata


h <- list()
tf<- list()
ar<- list()
hd<- list()
u5m <- list()
inps <- list()
arp <- list()
hivinc <- list()

psets_to_process <- as.integer(gsub("figdata.|\\.Rdata","",list.files("results/figdata/p22500",pattern = "figdata.\\d+",full.names = FALSE))) %>% sort

for(pset in psets_to_process){

  filename <- file.path("results/figdata/p22500",paste("figdata.",sprintf(file_number_format,pset),".Rdata",sep=""))
  tryCatch({
    load(filename)


    h[[pset]] <- figdata[[3]]
    tf[[pset]] <- figdata[[1]]
    ar[[pset]] <- figdata[[2]]
    hd[[pset]] <- figdata[[4]]
    inps[[pset]] <- figdata[[5]]
    arp[[pset]] <- figdata[[6]]
    hivinc[[pset]] <- figdata[[7]]
    rm(figdata)

  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

save(h,tf,ar,hd,u5m,inps,arp,file="results/figdata/p22500/figdata_all.Rdata")

Range of Inputs

Since the primary goal in @quattrochi2019 was to measure bias in indirect estimates across a set of populations that have experienced different rates of fertility, mortality, HIV infection, and ART (antiretroviral therapy) initiation. We adopted the same method to generate a set of populations using the following inputs: fertility, adult mortality, U5M, percent of 15–19 year olds who are sexually active, maternal mortality in 1990, percent annual decline in the maternal mortality rate, HIV incidence, duration of breastfeeding, and ART coverage. The original paper simulated one population for each combination of inputs resulting in a total of 4480 populations. Since the original paper did not provide information on the amount of time required to complete the simulation, we started our process with the same 4480 populations but without parallel computing, we could not complete the simulation within the timeframe of our study and stopped the simulation when it reached 1000 populations. For the purpose of our replication study, we set the seed for simulation to be 1, which was different from the seed value 1000 in the original paper.

Simulation Parameters

Initially, we attempted to some parameters in the simulation construction process. We tried to reduce the year interval of our study from [1906, 2010] to [1906, 1966] as well as the population age interval from [0,120] to [0,75]. This turned out to be unsuccessful since the the above parameters were not only used in the simulation process but also in the following model selection and regression application. For instance, to estimate HIV prevalence, the model divides the population into 7 age groups, which cannot be achieved if the year or age parameter was reduced. As a result, we used the same size and date for the initial population in the simulation. We initiated with 22,500 women who were aged 15 years in 1906, and ran the simulation through 2010.

For other parameters in simulation, we followed the original paper and used the same data sources. Annual probability of birth was defined as a function of calendar year and mother’s age and set to zero for women younger than 15 years and older than 49 years. Estimates of age-specific fertility rates (ASFR) from the United Nations Population Division’s World Fertility Data [@unitednations] was used and for years when ASFR were not available, the adjusted the nearest available ASFR using the interpolated estimates of the total fertility rate (TFR) from the United Nations Population Division’s World Population Prospects [@unitednations_2013] was used.

As mentioned in the original paper, according to [@chen2010fertility], women between 15–19 years old who were HIV-positive experienced higher ASFRs compared to HIV-negative women, depending if they were sexually active. Therefore the ASFR parameter was adjusted using the ratios estimated by [@chen2010fertility]. ASFR was also adjusted based on the studies that have found an increase in pregnancy following ART [@myer2010impact][@makumbi2011associations][@homsy2009reproductive]. The effect of ART on fertility likely depends on a variety of factors including age, cluster of differentiation 4 (CD4) count at initiation according to the original paper. For the simulation model, the assumption that the ASFR for 15–19 year olds is not affected by ART was made to further simplify the simulation. Inputs relating to maternal mortality was calculated using MMR (maternal deaths per 100,000 births) in 1990 and the annual decline in MMR since 1990. The initial value of the MMR according to [@hogan2010maternal] was either 0.0012 or 0.012 and the annual rate of decline was set to 0 or 7.3%.

For other inputs including the annual probability of HIV infection[@hogan2012spline][@heuveline2003hiv], the parameters governing CD4[@hallett2008estimating], ART coverage[@databank], the annual probability of death[@worldhealthorganization_2013] and the probability of mother-to-child transmission of HIV[@stover2008spectrum], we also used the same data sources and derivation methods.

Model Fitting and Model Selection

Newly-generated simulation data were then imported and fitted to various models to select for a best-performing model and the selection process used least errors to represent model performances.

The simulated data were first divided into training set (80%) and test set (20%). The training set was fitted using eight different models, which were further evaluated by calculating several error metrics (root mean square error, root median square error, mean relative error and median relative error) on both training set (in-sample) and test set (out-of-sample) shown in Table 1(Fig. \@ref(fig:table1)).

Compared to the estimation errors calculated in the original paper, we had very similar results of the root mean square errors and root median square errors for all models. However, our in-sample mean relative errors were consistently greater compared to the original paper and we suspect this could be due to the reduction in population size during the simulation. Consistent with the original paper, no single model in the table showed clear advantages for all the metrics over others we selected the generalized linear regression with alpha equal to 1 as our predictive model since it showed the best performance for most of the metrics.

knitr::include_graphics("table1_re.png")

# Get data from simulation
load("results/regdata_all.Rdata")
load("results/inputs.RData")

# add agegroup variable
nbd2k %<>% mutate(agegroup=rep(c(1:7),nrow(nbd2k)/7))

# Generate abs error and rel error
nbd2k %<>% mutate(abs.err=fiveq0_surv-fiveq0_hiv,
                  rel.err=(fiveq0_surv-fiveq0_hiv)/fiveq0_hiv
)

# new dep var 
nbd2k %<>% mutate(corr=fiveq0_hiv-fiveq0_surv,
                  corr2=corr+2.100105e-05)

##################################################################
##### MODEL FITTING AND MODEL SELECTION #####
##################################################################

compute_error_measures <- function(y,prediction,method,sample){

  # Absolute error
  abs.error <- abs(y-prediction)

  # Relative error
  relative.error <- abs.error/abs(y)

  # Root mean square error

  rmse <- sqrt(mean(abs.error^2))

  # Root median square error

  rmedse <- sqrt(median(abs.error^2))

  # Mean relative error

  mre <- mean(Filter(function(x) !is.infinite(x),relative.error),na.rm=TRUE)

  # Median relative error

  medre <- median(relative.error)

  data.frame(method=method,sample=sample,rmse=rmse,rmedse=rmedse,mre=mre,medre=medre,stringsAsFactors = FALSE)

}

# Prepare data
to.model <- nbd2k %>% mutate(
  agegroup=as.factor(agegroup)) 

full.model.formula <- corr ~ fiveq0_surv + agegroup + agegroup*hiv1990 + agegroup*hiv2000+  agegroup*hiv2010 + 
  agegroup*art_prev2005+ agegroup*art_prev2007+ agegroup*art_prev2009 + tfr2000 + tfr2010

minimum.model.formula <- corr ~ fiveq0_surv

# Out of sample data
fullsamp <- unique(to.model$i)
set.seed(400)

# dataset with 80% of full sample
ns1 <- sample(fullsamp, length(fullsamp)*.8, replace=FALSE)

# Select those i's from nbd2k
train.sample <- to.model %>% filter(i %in% ns1)

# dataset with 20% of full sample
tw1 <- setdiff(fullsamp,ns1)
test.sample <- to.model %>% filter(i %in% tw1)

error_table <- data.frame()

##### Full linear model #####

fit <- lm(full.model.formula,data=train.sample)
prediction <- predict(fit,newx = train.sample)
error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Full Linear Model","in-sample"))

prediction <- predict(fit,newx = test.sample)
error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Full Linear Model","out-of-sample"))

##### Forward and Backward selection #####

# maximum model

model.max <- glm(formula=full.model.formula,
                 family = "gaussian"(link="identity"), data=train.sample)

# minimum model

model.min <- glm(formula=minimum.model.formula, family = "gaussian"(link="identity"), data=train.sample)

# Forward BIC

fit <- step(object=model.min, scope=list(upper=model.max,lower=~1), direction="forward",k=log(nrow(train.sample)))

prediction <- predict(fit,newx = train.sample)

error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Forward Sel. BIC","in-sample"))

prediction <- predict(fit,newx = test.sample)

error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Forward Sel. BIC","out-of-sample"))

# Forward AIC

fit <- step(object=model.min, scope=list(upper=model.max,lower=~1), direction="forward",k=2)

prediction <- predict(fit,newx = train.sample)

error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Forward Sel. AIC","in-sample"))

prediction <- predict(fit,newx = test.sample)

error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Forward Sel. AIC","out-of-sample"))

# Backward BIC

fit <- step(model.max, direction="backward",k=log(nrow(to.model)))

prediction <- predict(fit,newx = train.sample)

error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Backward Sel. BIC","in-sample"))

prediction <- predict(fit,newx = test.sample)

error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Backward Sel. BIC","out-of-sample"))

# Backward AIC

fit <- step(model.max, direction="backward",k=2)

prediction <- predict(fit,newx = train.sample)

error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Backward Sel. AIC","in-sample"))

prediction <- predict(fit,newx = test.sample)

error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Backward Sel. AIC","out-of-sample"))

##### Glmnet #####
set.seed(500)
glmnet.errors <- lapply(c(0,0.5,1.0),function(alpha){

  # Get output vector for glmnet
  y <- train.sample %>% pull(all.vars(full.model.formula)[1])

  # Get predictors for glmnet
  x <- model.matrix(full.model.formula,train.sample)[,-1]

  set.seed(500)
  fit <- cv.glmnet(x,y,family="gaussian",alpha=alpha)

  prediction <- predict(fit,newx = x,s = "lambda.min")

  train.error <- compute_error_measures(train.sample$corr,prediction,paste0("glmnet, alpha=",alpha),"in-sample")

  xnew <- model.matrix(full.model.formula,test.sample)[,-1]

  prediction <- predict(fit,newx = xnew,s = "lambda.min")

  test.error <- compute_error_measures(test.sample$corr,prediction,paste0("glmnet, alpha=",alpha),"out-of-sample")

  bind_rows(train.error,test.error)

}) %>% bind_rows()

error_table %<>% bind_rows(glmnet.errors)

##### PCR #####

set.seed(100)
fit <- pcr(full.model.formula,data=train.sample,scale=TRUE,validation="CV", jackknife=TRUE)

ncomps <- c(20,30,selectNcomp(fit,method="onesigma",plot=TRUE)) %>% sort

pcr_errors <- lapply(ncomps,function(ncomp){

  prediction <- predict(fit,newdata = train.sample, type = "response",ncomp = ncomp)

  train.error <- compute_error_measures(train.sample$corr,prediction,paste0("PCR, ncomp=",ncomp),"in-sample")

  prediction <- predict(fit,newx = test.sample,ncomp = ncomp)

  test.error <- compute_error_measures(test.sample$corr,prediction,paste0("PCR, ncomp=",ncomp),"out-of-sample")

  bind_rows(train.error,test.error)

}) %>% bind_rows()

error_table %<>% bind_rows(pcr_errors)

##### PLS #####

set.seed(100)
fit <- plsr(full.model.formula,data=train.sample,scale=TRUE,validation="CV", jackknife=TRUE)
summary(fit)
validationplot(fit)

# 32 components are enough to account for at least 90% of variaiblity in both X and Y. Also use optimal number of components

ncomps <- c(32,selectNcomp(fit,method="onesigma",plot=TRUE)) %>% sort

pls_errors <- lapply(ncomps,function(ncomp){

  prediction <- predict(fit,newdata = train.sample, type = "response",ncomp = ncomp)

  train.error <- compute_error_measures(train.sample$corr,prediction,paste0("PLS, ncomp=",ncomp),"in-sample")

  prediction <- predict(fit,newx = test.sample,ncomp = ncomp)

  test.error <- compute_error_measures(test.sample$corr,prediction,paste0("PLS, ncomp=",ncomp),"out-of-sample")

  bind_rows(train.error,test.error)

}) %>% bind_rows()

error_table %<>% bind_rows(pls_errors)

###### Table 4 #####

# Prediction errors from models to correct bias in indirect estimates of U5M

# Save error table

ft <- error_table  %>% 
  mutate_at(vars(one_of("rmse","rmedse")),funs(sprintf("%0.6f",.))) %>%
  flextable() %>% set_header_labels(rmse="Root mean\nsquare error",
                                    rmedse="Root median\nsquare error",
                                    mre="Mean relative\nerror",
                                    medre="Median relative\nerror") %>%
  merge_v(j=1)

doc <- read_docx() %>%
  body_add_par(value = "Model selection.", style = "table title") %>% 
  body_add_flextable(ft)

doc %>% print("tables/table4.docx")

# glmnet with alpha = 1 was chosen to be the best-fitting-model

##### Best fitting model #####

# Get output vector for glmnet

y <- to.model %>% pull(all.vars(full.model.formula)[1])

# Get predictors for glmnet

x <- model.matrix(full.model.formula,to.model)[,-1]

alpha <- 1.0

set.seed(500)
best.fitting.model <- cv.glmnet(x,y,family="gaussian",alpha=alpha)

lambda <-best.fitting.model$lambda.min

# To use for predictions later.

best.fitting.model <- glmnet(x,y,family="gaussian",alpha=alpha,lambda = lambda)

save(x,best.fitting.model,full.model.formula,file="results/best_fitting_model.RData")

Replication of Figures

knitr::include_graphics("figure4-ori.png")

knitr::include_graphics("figure5-ori.png")

Figure 4 (Fig. \@ref(fig:ori-fig-4)) and 5 (Fig. \@ref(fig:ori-fig-5)) in the original article were created by applying the best-performing model as well as an independent model [@ward2008effect] to the empirical data from 2010 DHS for Malawi and Tanzania [@nationalbureauofstatistics-nbs_tanzania_icf_macro_2011][@nationalstatisticaloffice_icf_macro_2011] respectively to estimate U5M of different age groups of population. By comparing differences in U5M estimates of unadjusted data and adjusted data by two models, the authors suggested that the new model was able to correct the bias from indirect estimates.

With the new simulation data and best-fitting model, we re-generated both figures by using R scripts provided (Fig. \@ref(fig:rep-fig-4) and Fig. \@ref(fig:rep-fig-5)).

To prepare data for plotting, we created three functions with R code provided by the authors to estimate U5M using the indirect method, the Ward-Zaba model and our best-performing model. These three functions were then further integrated into a function called plot_fig to create our replicated figures. We also included a file with necessary data for calculating the confidence interval by bootstrapping for prediction with our best-fitting model, since this step usually takes more than 40 minutes to finish.

indirect_estimate <- function(dataset, fig_country){

  # extract data of the plotting country
  cntry <- dataset %>% filter(country==fig_country)  

  # indirect estimates with empirical data
  source("R/indirect_estimates_functions_for_reg.R")
  ind_surv <- ind_est_computations(cntry,2010) %>% as.data.frame() %>% mutate(agegroup=1:7)
  ind_surv %<>% 
    select(agegroup,fiveq0=fiveq0, t_ref,refdate)

  # merge indirect estimates based on empirical data from Malawi or Tanzania with
  # data on HIV, ART and fertility from that same country
  # ind_est contains the results of indirect_estimates_functions applied to the data from Malawi or Tanzania
  # cntry contains the data from Malawi or Tanzania (from facevalidity.csv)
  iep <- cntry %>% 
    left_join(ind_surv,by="agegroup") %>% 
    rename(fiveq0_surv=fiveq0) %>% 
    mutate(agegroup=factor(agegroup)) %>% 
    select(-cd)
  colnames(iep) <- c("country","agegroup", "ceb","cs","tfr2010","tfr2000","hiv1990", "hiv2000", "hiv2010","art2005","art2006","art2007","art2008","art2009","art2010","art2011","art_prev2005","art_prev2006","art_prev2007","art_prev2008","art_prev2009","prop15to19_2010","prev15to19_2010","hiv2005","hiv2006","hiv2007","hiv2008","hiv2009","fiveq0_surv", "t_ref","refdate")

  return(iep) 
}

wz_model <- function(dataset, fig_country, iep){

  # extract data of the plotting country
  cntry <- dataset %>% filter(country==fig_country)

  # Ward & Zaba (2008) coefficients
  a=c(.1134,.1202,.1107,.1720,.1749,.1504,.2552)
  b=c(-.0226,-.0438,-.0882,-.1412,-.1758,-.1849,-.3269)
  c=c(-.1112,-.0978,-.0373,.0163,.042,.1807,.5594)

  # W&Z adjustment factor from applying coefficients to empirical data
  nz = a*iep$hiv2000+b*(iep$hiv2000^2)+c*iep$prev15to19_2010

  #Convert each n q 0 + n(z) into 5q0

  # Standard indirect estimates, adding Ward&Zaba adj factors
  # n = c(1,2,3,5,10,15,20)
  # UN General Female LT, e(0)=60, for above n
  l = c(.92688,.91025,.90151,.89193,.89534,.89002,.88195)
  q = 1-l
  logitMLT=.5*log(q/(1-q))
  alpha = NA
  fiveq0WZ = NA
  Y5 = NA

  nq0 <- ind_est_computations_core(cntry,2010)$nq0

  for(i in 1:7){
    Y5[i] = .5*log((nq0[i]+nz[i])/(1-(nq0[i]+nz[i])))
    alpha[i] = Y5[i] - logitMLT[i]
    fiveq0WZ[i] = exp(2*(alpha[i]+logitMLT[4]))/(1+exp(2*(alpha[i]+logitMLT[4])))
  }

  # fiveq0WZ are the Ward & Zaba estimates of U5M
  return(fiveq0WZ)
}

best_model <- function(iep, fig_name){

  # Load saved best fit model
  # Contains best.fitting.model and full.model.formula
  load("results/best_fitting_model.RData")

  # Load bootstrap estimates for model predictions
  if(file.exists("results/boot_outs.RData")){
    load("results/boot_outs.RData")
  }else{
    boot_outs <- list()  
  }

  # Get the new predictions in matrix format
  nx <- model.matrix(full.model.formula,iep %>% mutate(corr=0))[,-1]

  # Make predictions
  prediction <- predict(best.fitting.model,newx = nx,se=TRUE,s="lambda.min") %>% 
    as.data.frame %>% 
    set_colnames("PredictedAdj")

  # Bootstrap confidence intervals for predictions
  if(is.null(boot_outs[[fig_name]])){
    # Get output vector for glmnet
    y <- to.model %>% pull(all.vars(full.model.formula)[1])

    # Get predictors for glmnet
    x <- model.matrix(full.model.formula,to.model)[,-1]
    Xy <-  cbind(y,x)
    boot_models <- list(length=5)
    t0 <- predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector
    tt <- boot_models[1:5] %>%purrr::map(~predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector) %>% unlist()
    #lambda <- best.fitting.model$lambda.min
    lambda <- best.fitting.model$lambda
    theta <- as.matrix(coef(best.fitting.model, s = "lambda.min"))
    pi_hat <- predict(best.fitting.model, newx = x, s = "lambda.min", type = "response")
    n <- length(pi_hat)
    y_star <- sapply(seq_len(length(tt)), function(i) ifelse(runif(n) <= pi_hat, 1, 0))
    suff_stat <- t(y_star) %*% x

    #bcapar(t0=t0,tt=tt,bb=suff_stat)
    # This is the function that we will use for each bootstrap sample
    theta <- function(xy,nx,glmnet_alpha,lambda){
      X <- xy[,2:(dim(xy)[2])]
      Y <- xy[,1]

      # Fit the model for the sample and using the same lambda as in the best model fitted
      m <- glmnet(X,Y,family="gaussian",alpha=glmnet_alpha,lambda=lambda)

      # Keep the models computed for each bootstrap sample in order to compute prediction intervals faster
      # boot_models <<- c(boot_models,list(m))

      # Prediction
      predict(m,newx = nx,s="lambda.min") %>% as.vector

    }

    # Keep the models computed for each bootstrap sample in order to compute prediction intervals faster
    # boot_models <- list()

    # Compute bootstrapped confedence intervals for each prediction point
    boot_outs[[fig_name]] <- lapply(1:nrow(nx),function(i){
      # i <- 1
      new_data <- nx[i,,drop=FALSE]

      set.seed(600)

      bcajack(x=Xy,B=500,func=theta,new_data,1.0,lambda)
    })
  }

  # Get two tailed 95% CI values 
  ci <- boot_outs[[fig_name]] %>% lapply(function(r){

    r$lims %>% as.data.frame() %>% mutate(quantile=as.numeric(rownames(.))) %>% filter(quantile %in% c(0.025,0.975)) %>% select(quantile,bca) %>% mutate(quantile=case_when(quantile==0.025 ~ "ci.lo",TRUE~ "ci.hi")) %>% spread(quantile,bca)
  }) %>% bind_rows()

  prediction %<>% bind_cols(ci)
  return(prediction)
}

plot_fig <- function(fig_name, fig_country){
  # Load empirical data
  fv <- fread("data/facevalidity.csv") %>% set_colnames(c("country","agegroup", "ceb","cs","tfr2010","tfr2000","hiv1990", "hiv2000", "hiv2010","art2005","art2006","art2007","art2008","art2009","art2010","art2011","art_prev2005","art_prev2006","art_prev2007","art_prev2008","art_prev2009","prop15to19_2010","prev15to19_2010","hiv2005","hiv2006","hiv2007","hiv2008","hiv2009")) %>% mutate(cd=ceb-cs)


  # indirect estimates with empirical data
  iep <- indirect_estimate(fv, fig_country)

  # estimate from Ward & Zaba (2008) model
  fiveq0WZ <- wz_model(fv, fig_country, iep)

  # make prediction with best-performing model
  prediction <- best_model(iep, fig_name)

  # Prepare data for plot
  to.plot <- bind_cols(iep, prediction,data.frame(fiveq0WZ=fiveq0WZ)) %>% 
    mutate(fit=fiveq0_surv+PredictedAdj,lwr=fiveq0_surv+ci.lo,upr=fiveq0_surv+ci.hi)%>% # Compute adjusted values and prediction interval upper and lower limits
    select(refdate,upr,lwr,Adjusted=fit,Unadjusted=fiveq0_surv,`Ward & Zaba`=fiveq0WZ) %>%   # Transform data for plot
    gather(type,value,-refdate,-upr,-lwr)

  # Plot
  ggplot(to.plot,aes(x=refdate,y=value,shape=type)) +
    geom_point(color="black", size=3) +
    geom_errorbar(aes(ymin=lwr, ymax=upr))+ 
    xlab("Year") +
    ylab("Under-five mortality (deaths per live birth)") +
    scale_shape_discrete(name="Estimate type", # Legend label, use darker colors
                         labels=c("Adjusted", "Unadjusted","Ward & Zaba"))+
    labs(title= fig_country) + 
    theme_classic() +
    theme(axis.text = element_text(color="black"))

}

# Figure 4
# Under-five mortality estimates using CEB/CS from 2010 DHS for Malawi:
# unadjusted (crude), adjusted (corrected) using best-performing model, and using Ward &
# Zaba (2008) model

plot_fig('figure4','Malawi')

# Figure 5
# Under-five mortality estimates using CEB/CS from 2010 DHS for Tanzania:
# unadjusted (crude), adjusted (corrected) using best-performing model, and using Ward &
# Zaba (2008) model

plot_fig('figure5','Tanzania')

We revised some parts of the regression model and corrected inconsistent variable names in the original code during the debugging process. For instance, caught corrected the variable name “lambda.min” where it should have been “lambda”.

 boot_models <- list(length=5)
    t0 <- predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector
    tt <- boot_models[1:5] %>%purrr::map(~predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector) %>% unlist()
    #lambda <- best.fitting.model$lambda.min
    lambda <- best.fitting.model$lambda
    theta <- as.matrix(coef(best.fitting.model, s = "lambda.min"))
    pi_hat <- predict(best.fitting.model, newx = x, s = "lambda.min", type = "response")
    n <- length(pi_hat)
    y_star <- sapply(seq_len(length(tt)), function(i) ifelse(runif(n) <= pi_hat, 1, 0))
    suff_stat <- t(y_star) %*% x

Result

Our implementation of the model to estimate the bias in U5M in HIV/AIDS affected populations from @quattrochi2019 was able to replicate the process and output target tables and figures. Consistently, indirect methods underestimate U5M in older populations, while little difference was shown in younger age groups between unadjusted and adjusted estimates. Applying our model to 2010 survey data from Malawi and Tanzania, we show that indirect methods would underestimate U5M by up to 7%.

Table 2 (Fig. \@ref(fig:table2)) shows our estimation by applying the best performing model on 1,000 simulated populations and consistent with the original implementation, the mean HIV prevalence we calculated among women aged 15–49 across populations was 7% in 1990, 15% in 2000, and 11% in 2010. The highest HIV prevalence in any simulation was 40% in 2000. Our results related to ART showed the same trend as in the original paper that both prevalence and coverage increase with time, although the exact numbers varied. The mean ART coverage across our simulations was 0% in 2004 and 8% in 2010. The mean ART prevalence was < 0.1% in 2005 and 0.19% in 2009. The highest ART prevalence in any simulation was 2.3% in 2009 and this was 1.7% lower compared to the original model. The mean TFR across simulations was 4.88 in 2000 and 4.28 in 2010, also consistent with the original model.

knitr::include_graphics("table2_re.png")

Table 3 (Fig. \@ref(fig:table3)) shows the bias in indirect estimates across age groups. According to the original paper, negative numbers indicate that unadjusted estimates were lower than adjusted estimates. The mean absolute bias was largest for estimates from women aged 35–39 and 40–44 (−0.021) and smallest for estimates from women aged 15–19 (0.000). Our results showed the same age group with a small variance in the exact numbers. The original paper also emphasized on the largest absolute bias recorded, which was −0.071 for estimates from women 35–39 in our calculation and -0.069 for them. The paper pointed out that this means that the estimated U5M was 71 deaths per 1000 live births lower when using only reports from surviving women compared to reports from surviving women and women who died from HIV/AIDS.

knitr::include_graphics("table3_re.png")

Based on our results, we believe that our effort in replicating @quattrochi2019 was successful despite some inconsistency. We argue that the differences could be due to the smaller population size we used in our simulation though further investigation would be needed to discover the exact causes.

Conclusion

Our goal of this project was to replicate a published research paper and reproduce figures in it. We were able to accomplish this goal by replicating the key figures in the original paper using R and publishing well-structured source code along with data source in a licensed Github repository. This paper.Rmd R Markdown file is fully executable therefore can be used as the main entry point to our project. However, we intentionally set a couple code chunks with evaluation as False since simulating those data sources does not fit within the timeframe for any group trying to reproduce them. Instead, we created .RData file as a battery-included solution for future reproducers.

We leveraged some useful reproducibility techniques introduced in the class such as Tidyverse libraries and R Markdown syntax. The in-file executable R chunks come extremely handy during our reproduction process.

There are, however, challenges throughout our explorations. The first major issue we encountered was the lengthy source data generation step which required extensive research and deep dive into the source code. Given the nature of the HIV/AIDS estimation research, the data samples required had to be comprehensive enough to cover multiple geological locations and a wide range of populations. Therefore, the original authors designed the raw data to be generated on-the-go rather than archived in open source. Our team initially underestimated the challenge of reproducing all sources and this was also due to the lack of information on completion time in the original paper. We had to reduce population size during simulation to ensure the quality of our research and a timely deliverable. Looking back, we can identify this as an area that we could improve (focusing on verification of data source before replication). Another challenge was using R in an unfamiliar field of research. Given none of us had previous domain knowledge regarding HIV/AIDS research, we spent a lot of time to understand the terminologies while exploring the rationales behind the R code used in the original paper. We had issues trying to understnad specific failures in R givern that multiple team members had different R environment. For example, we still couldn’t fully understand some of the “knit” failures which had been resolved by cleaning up cache in R Studio. We strongly believe that using Docker or a shared development environment could drastically improve our developer experience where most errors could be reproduced easily no matter where the Docker image is running.

In general, we gained hands-on experience from reproducing a research project. After rounds of reviews and merging conflicts, we developed our own Github usage convention and code styles. We would like to carry these invaluable skills to our future Data Science journey.