Generalized linear model (GLM) is a widely used method in research. It is well-established with high flexibility and relatively straightforward interpretation. However, GLMs usually require making assumptions and having the underlying assumptions not hold could invalidate the whole model.
The application of GLM can be observed in many fields including public health. @quattrochi2019 introduced an GLM via penalized maximum likelihood to correct the under-5 mortality (U5M) in populations affected by HIV/AIDS. This method was developed since the U5M is commonly estimated using survey-based approaches in less developed regions and it is based on a fundamental assumption that the mother's survival is not related to her children's, which is often violated in populations affected by HIV/AIDS. Therefore, the author sought a method to correct this bias using GLM.
In our work, we present our endeavor to replicate the model building and regression analysis introduced in @quattrochi2019. In the meantime, we attempted using different parameters in both constructing the model and applying it to test how robust the original method was. We will in this paper report our method and results from the replication attempt.
knitr::opts_chunk$set(echo = TRUE)
if(!require(tidyverse)){ install.packages('http://cran.r-project.org/src/contrib/Archive/tidyverse/tidyverse_1.2.1.tar.gz',repos=NULL, type="source") } packageVersion("tidyverse") require(tidyverse) # ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── # ✔ ggplot2 3.1.0 ✔ purrr 0.3.0 # ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1 # ✔ tidyr 0.8.2 ✔ stringr 1.4.0 # ✔ readr 1.1.1 ✔ forcats 0.3.0 # ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── # ✖ dplyr::filter() masks stats::filter() # ✖ dplyr::lag() masks stats::lag() if(!require(iterators)){ install.packages('https://cran.r-project.org/src/contrib/Archive/iterators/iterators_1.0.10.tar.gz',repos=NULL, type="source") } packageVersion("iterators") # 1.0.10 require(iterators) if(!require(foreach)){ install.packages('https://cran.r-project.org/src/contrib/Archive/foreach/foreach_1.4.4.tar.gz',repos=NULL, type="source") } packageVersion("foreach") # 1.4.4 require(foreach) if(!require(bcaboot)){ install.packages("bcaboot",dependencies = TRUE,repos='http://cran.us.r-project.org') } require(bcaboot) packageVersion("bcaboot") # 0.2.1 if(!require(data.table)){ install.packages("data.table",dependencies = TRUE,repos='http://cran.us.r-project.org') } require(data.table) packageVersion("data.table") # 1.11.8 if(!require(pls)){ install.packages("pls",dependencies = TRUE,repos='http://cran.us.r-project.org') } require(pls) packageVersion("pls") # 2.7.0 if(!require(glmnet)){ install.packages('https://cran.r-project.org/src/contrib/Archive/glmnet/glmnet_2.0-16.tar.gz',repos=NULL, type="source") } require(glmnet) packageVersion("glmnet") # 2.0.16 if(!require(officer)){ install.packages("officer",dependencies = TRUE,repos='http://cran.us.r-project.org') } require(officer) packageVersion("officer") # 0.3.2 if(!require(flextable)){ install.packages("flextable",dependencies = TRUE,repos='http://cran.us.r-project.org') } require(flextable) packageVersion("flextable") # 0.5.1 if(!require(magrittr)){ install.packages("magrittr",dependencies = TRUE,repos='http://cran.us.r-project.org') } require(magrittr) packageVersion("magrittr") # 1.5
The original model was implemented in R (v3.5.2) [team2013r] with web application (https://johnquattrochi.com/bias/) to show that indirect method such as surveying can underestimate U5M by 0-41% in populations where 0-40% of the people are affected by HIV/AIDS. The model was written using the R language [team2013r] and some of the main packages used were tidyverse (v1.2.1) and foreach (v1.4.4). The original implementation utilized the doReids package (v1.1.1) that launches a back-end server in Docker for parallel processing that was not adopted in our replication. The main limitation is the computing resource we had for this project within a short period of time. Based on experiments performed on a m4.xlarge EC2 instance on AWS, the original simulation settings in the target paper consumed over 63% CPU and 4GB of memory and kept running for over 3 days using Docker doRedis package. The operator estimated of finishing the entire simulation step would require an 8-core CPU machine more than 1 week. In order to complete the simulation for constructing the model in a reasonable time frame without parallel computing, we had to stop the simulation and reduced the number of populations used in the simulation input from 4480 to 1000. The code provided by the author on Github (https://github.com/jquattro/hiv-childmort-bias) was mostly well-organized and documented. However, we discovered some mistakes regarding output inconsistency and wrong variable names. Despite the debugging process, we were able to replicate the implementation.
The goal of the large-scale simulation used in the original paper was to build a demographic history that accounted for the variability among actual populations to measure if biases such as the independence assumption of mother and children survival rate in a population influenced by HIV/AIDS existed. @quattrochi2019 used stochastic simulation methods to generate a set of populations with input parameters including demographic features, mortality rate for U5M, maternal mortality rate, HIV prevalence, etc. The input simulation parameters vary over a large interval for different age groups across different populations. In each time step of one year, each woman in a population has some probabilities to give birth (for women between 15 to 49 years old), being infected with HIV, initiating ART (if HIV-positive), and dying given the simulation parameters. Through the aging process in the simulation population, children born with infected HIV mothers will have a probability of getting infected while all children will have a probability of mortality.
For calculating relative and absolute bias, we used the formulas provided in the original paper. $$\mbox{Relative bias} = \frac{IE_{survivors} - IE_{survivors,HIV deaths}}{IE_{survivors,HIV deaths}}$$
$$\mbox{Absolute bias} = IE_{survivors} - IE_{survivors,HIV deaths}$$ According to [@quattrochi2019], $IE_{survivor}$ was the estimattion of U5M based on living women in 2010, and $IE_{survivors,HIV deaths}$ was the estimation of U5M based on living women in 2010 and women died from HIV/AIDS before 2010 given the condition that they would have been 15-49 in 2010 if they had survived. $a_{d}$
# This section of code is designed to be no executable as it takes a long time to generate simulations. # Instead, input data was provided as .RData files ##### DEFINE FOLDERS ####### if(!dir.exists("results/figdata")){ dir.create("results/figdata",recursive = TRUE) } if(!dir.exists("results/regdata")){ dir.create("results/regdata",recursive = TRUE) } # Simulation results will be stored here if(!dir.exists("results/models")){ dir.create("results/models",recursive = TRUE) } ##### SIMULATION ##### # Read data hivhogan = read.csv("data/inc_curves.csv",head=TRUE) mort_series = read.csv("data/IHME_female_mortSMALL.csv",head=TRUE) adultmort = read.csv("data/MLTfemSMALL.csv",head=TRUE) worldfert= read.csv("data/world_fert.csv",head=TRUE) tfr_series= read.csv("data/tfr_gapminder_long.csv",head=TRUE) art_series= read.csv("data/sampleART.csv",head=TRUE) u5m_edit= read.csv("data/u5m_edit.csv",head=TRUE) matmort = read.csv("data/matmort.csv",head=TRUE) # Add HIV-free populations to hivhogan hivhogan$Country <- factor(hivhogan$Country,levels=c(levels(hivhogan$Country),"zero")) hivhogan[63,1] = "zero" hivhogan[63,c(2:47)]=0 # Create parameters sets fertcountry <- c("Botswana","Uganda") cm_cntry <- c("Mali","Morocco") am_cntry <- c("Madagascar","Sudan") sexactive15 <- c(30,70) mmr0 <- c(0.0012,0.012) mmr_dec <- c(0,0.073) # annual percent decline in MMR curve <- c("BotswanaUrban","LesothoRural","MalawiRural","UgandaRural","CamerounRural","zero","BotUrb2x") bfeed <- c(6,18) art_col <- c("zero","Botswana","Cameroon","Malawi","Bot_dub") growth <- c(0.03) yrend <- 2010 # Starting year yrstart <- 1906 # CD4 threshold threshold <- 200 # expand.grid() will create dataset with unique combinations in each row inputs <- expand.grid(fertcountry, cm_cntry, am_cntry, sexactive15,mmr0,mmr_dec,curve,bfeed,art_col,growth,yrend,yrstart,threshold) names(inputs) = c("fertcountry", "cm_cntry","am_cntry","sexactive15","mmr0","mmr_dec","curve","bfeed","art_col","growth","yrend","yrstart","threshold") # Save the inputs set for later save(inputs,file = "results/models/inputs.RData") file_number_format <- paste0("%0",nchar(as.character(nrow(inputs))),"d") # Set size of initial population initial_populations<- c(7500,15000,22500) years <- c(1906:2010) ages <- c(0:120) # Run the simulations for each initial population for(ip in initial_populations){ # Parameter sets already processed #ip = 150 psets_already <- as.integer(gsub("models|\\.Rdata","",list.files(paste0("./results/models/p",ip),pattern = "models\\d+",full.names = FALSE))) psets_to_process <- setdiff(1:nrow(inputs),psets_already) # Run simulation for each set of parameters foreach(pset=psets_to_process,.packages = c("dplyr")) %dopar%{ inp <- inputs[pset,] source("R/simulation_functions.R") source("R/indirect_estimates_functions.R") # Run simmulation set.seed(1) results <- bigsim(inp,initialpop=ip,years,ages,hivhogan,mort_series,adultmort,worldfert,tfr_series,art_series,u5m_edit,matmort) # Save simulation results save(results,file=file.path(paste0("./results/models/p",ip),paste("models",sprintf(file_number_format,pset),".Rdata",sep=""))) } } # Collect hiv2000 differences among initial populations # For each input row hiv2000 <- foreach(pset=1:1000,.packages = c("dplyr")) %dopar%{ # And each initial population paste0("p",initial_populations) %>% lapply(function(popini){ # Load the results file for that initial population and input set load(file.path("results/models/",popini,paste("models",sprintf(file_number_format,pset),".Rdata",sep=""))) print(paste(initial_populations,pset)) # Extract the value of hiv2000 r <- results[[1]]$hiv2000 names(r) <- popini r }) %>% unlist() %>% t %>% data.frame(input_set=pset,.) # Put everything in one data.frame } %>% bind_rows # Compute difference hiv2000 <- hiv2000 %>% mutate(p7500_minus_p15000=p7500-p15000,p15000_minus_p22500=p15000-p22500) save(hiv2000,file = "results/hiv2000.RData") ##### INDIRECT ESTIMATES FOR INITAL POPULATION 22500 ##### # See if we have something not processed psets_already <- intersect(as.integer(gsub("figdata.|\\.Rdata","",list.files("./results/figdata/p22500",pattern = "figdata.\\d+",full.names = FALSE))), as.integer(gsub("regdata.|\\.Rdata","",list.files("./results/regdata/p22500",pattern = "regdata.\\d+",full.names = FALSE)))) psets_to_process <- setdiff(1:nrow(inputs),psets_already) # For each input set not processed foreach(pset=1:1000,.packages = c("dplyr")) %dopar%{ source("R/indirect_estimates_functions_revised.R") load(file.path("results/models/p22500/",paste("models",sprintf(file_number_format,pset),".Rdata",sep=""))) print(pset) w <- results[[7]] # Are babies of HIV moms getting HIV? # What percent of HIV moms babies have HIV? # How many born with HIV? # df = as.data.frame(w) # y <- subset(df, hiv_date==2009) # plot(df$dob, df$hiv_date) ################################################## # format data momkidsclean <- as.data.frame(w) # Using all surviving women isf <-ind_est(momkidsclean) # Using all women isf_all <-ind_est_all(momkidsclean) #Using surviving women and women who died from HIV isf_hiv <- ind_est_hiv(momkidsclean) ie.three <- cbind(isf,isf_all,isf_hiv,pset) ################################################################### # Save results for regressions and for figures ################################################################### inp.plus <- as.data.frame(results[1]) regdata <- cbind(ie.three,inp.plus[1,], row.names = NULL) # Data for regressions figdata <- results[c(2:6,8:9)] # Data for figures save(regdata,file=file.path("results/regdata/p22500/",paste("regdata.",sprintf(file_number_format,pset),".Rdata",sep=""))) save(figdata,file=file.path("results/figdata/p22500/",paste("figdata.",sprintf(file_number_format,pset),".Rdata",sep=""))) print('Saved') rm(results,regdata,figdata,momkidsclean,isf,isf_all,isf_hiv,ie.three) } ##### COLLECT DATA FROM ALL SIMULATIONS FOR INITIAL POPULATION 22500 ##### # Merge regdata psets_to_process <- as.integer(gsub("regdata.|\\.Rdata","",list.files("./results/regdata/p22500",pattern = "regdata.\\d+",full.names = FALSE))) %>% sort nbd2k <- psets_to_process %>% lapply(function(pset){ filename <- file.path("results/regdata/p22500",paste("regdata.",sprintf(file_number_format,pset),".Rdata",sep="")) tryCatch({ load(filename) regdata %>% mutate(i=pset) %>% select(i,everything()) }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) }) %>% bind_rows() %>% as.data.frame() save(nbd2k,file="results/regdata/p22500/regdata_all.Rdata") # Merge figdata h <- list() tf<- list() ar<- list() hd<- list() u5m <- list() inps <- list() arp <- list() hivinc <- list() psets_to_process <- as.integer(gsub("figdata.|\\.Rdata","",list.files("results/figdata/p22500",pattern = "figdata.\\d+",full.names = FALSE))) %>% sort for(pset in psets_to_process){ filename <- file.path("results/figdata/p22500",paste("figdata.",sprintf(file_number_format,pset),".Rdata",sep="")) tryCatch({ load(filename) h[[pset]] <- figdata[[3]] tf[[pset]] <- figdata[[1]] ar[[pset]] <- figdata[[2]] hd[[pset]] <- figdata[[4]] inps[[pset]] <- figdata[[5]] arp[[pset]] <- figdata[[6]] hivinc[[pset]] <- figdata[[7]] rm(figdata) }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) } save(h,tf,ar,hd,u5m,inps,arp,file="results/figdata/p22500/figdata_all.Rdata")
Since the primary goal in @quattrochi2019 was to measure bias in indirect estimates across a set of populations that have experienced different rates of fertility, mortality, HIV infection, and ART (antiretroviral therapy) initiation. We adopted the same method to generate a set of populations using the following inputs: fertility, adult mortality, U5M, percent of 15–19 year olds who are sexually active, maternal mortality in 1990, percent annual decline in the maternal mortality rate, HIV incidence, duration of breastfeeding, and ART coverage. The original paper simulated one population for each combination of inputs resulting in a total of 4480 populations. Since the original paper did not provide information on the amount of time required to complete the simulation, we started our process with the same 4480 populations but without parallel computing, we could not complete the simulation within the timeframe of our study and stopped the simulation when it reached 1000 populations. For the purpose of our replication study, we set the seed for simulation to be 1, which was different from the seed value 1000 in the original paper.
Initially, we attempted to some parameters in the simulation construction process. We tried to reduce the year interval of our study from [1906, 2010] to [1906, 1966] as well as the population age interval from [0,120] to [0,75]. This turned out to be unsuccessful since the the above parameters were not only used in the simulation process but also in the following model selection and regression application. For instance, to estimate HIV prevalence, the model divides the population into 7 age groups, which cannot be achieved if the year or age parameter was reduced. As a result, we used the same size and date for the initial population in the simulation. We initiated with 22,500 women who were aged 15 years in 1906, and ran the simulation through 2010.
For other parameters in simulation, we followed the original paper and used the same data sources. Annual probability of birth was defined as a function of calendar year and mother’s age and set to zero for women younger than 15 years and older than 49 years. Estimates of age-specific fertility rates (ASFR) from the United Nations Population Division’s World Fertility Data [@unitednations] was used and for years when ASFR were not available, the adjusted the nearest available ASFR using the interpolated estimates of the total fertility rate (TFR) from the United Nations Population Division’s World Population Prospects [@unitednations_2013] was used.
As mentioned in the original paper, according to [@chen2010fertility], women between 15–19 years old who were HIV-positive experienced higher ASFRs compared to HIV-negative women, depending if they were sexually active. Therefore the ASFR parameter was adjusted using the ratios estimated by [@chen2010fertility]. ASFR was also adjusted based on the studies that have found an increase in pregnancy following ART [@myer2010impact][@makumbi2011associations][@homsy2009reproductive]. The effect of ART on fertility likely depends on a variety of factors including age, cluster of differentiation 4 (CD4) count at initiation according to the original paper. For the simulation model, the assumption that the ASFR for 15–19 year olds is not affected by ART was made to further simplify the simulation. Inputs relating to maternal mortality was calculated using MMR (maternal deaths per 100,000 births) in 1990 and the annual decline in MMR since 1990. The initial value of the MMR according to [@hogan2010maternal] was either 0.0012 or 0.012 and the annual rate of decline was set to 0 or 7.3%.
For other inputs including the annual probability of HIV infection[@hogan2012spline][@heuveline2003hiv], the parameters governing CD4[@hallett2008estimating], ART coverage[@databank], the annual probability of death[@worldhealthorganization_2013] and the probability of mother-to-child transmission of HIV[@stover2008spectrum], we also used the same data sources and derivation methods.
Newly-generated simulation data were then imported and fitted to various models to select for a best-performing model and the selection process used least errors to represent model performances.
The simulated data were first divided into training set (80%) and test set (20%). The training set was fitted using eight different models, which were further evaluated by calculating several error metrics (root mean square error, root median square error, mean relative error and median relative error) on both training set (in-sample) and test set (out-of-sample) shown in Table 1(Fig. \@ref(fig:table1)).
Compared to the estimation errors calculated in the original paper, we had very similar results of the root mean square errors and root median square errors for all models. However, our in-sample mean relative errors were consistently greater compared to the original paper and we suspect this could be due to the reduction in population size during the simulation. Consistent with the original paper, no single model in the table showed clear advantages for all the metrics over others we selected the generalized linear regression with alpha equal to 1 as our predictive model since it showed the best performance for most of the metrics.
knitr::include_graphics("table1_re.png")
# Get data from simulation load("results/regdata_all.Rdata") load("results/inputs.RData") # add agegroup variable nbd2k %<>% mutate(agegroup=rep(c(1:7),nrow(nbd2k)/7)) # Generate abs error and rel error nbd2k %<>% mutate(abs.err=fiveq0_surv-fiveq0_hiv, rel.err=(fiveq0_surv-fiveq0_hiv)/fiveq0_hiv ) # new dep var nbd2k %<>% mutate(corr=fiveq0_hiv-fiveq0_surv, corr2=corr+2.100105e-05) ################################################################## ##### MODEL FITTING AND MODEL SELECTION ##### ################################################################## compute_error_measures <- function(y,prediction,method,sample){ # Absolute error abs.error <- abs(y-prediction) # Relative error relative.error <- abs.error/abs(y) # Root mean square error rmse <- sqrt(mean(abs.error^2)) # Root median square error rmedse <- sqrt(median(abs.error^2)) # Mean relative error mre <- mean(Filter(function(x) !is.infinite(x),relative.error),na.rm=TRUE) # Median relative error medre <- median(relative.error) data.frame(method=method,sample=sample,rmse=rmse,rmedse=rmedse,mre=mre,medre=medre,stringsAsFactors = FALSE) } # Prepare data to.model <- nbd2k %>% mutate( agegroup=as.factor(agegroup)) full.model.formula <- corr ~ fiveq0_surv + agegroup + agegroup*hiv1990 + agegroup*hiv2000+ agegroup*hiv2010 + agegroup*art_prev2005+ agegroup*art_prev2007+ agegroup*art_prev2009 + tfr2000 + tfr2010 minimum.model.formula <- corr ~ fiveq0_surv # Out of sample data fullsamp <- unique(to.model$i) set.seed(400) # dataset with 80% of full sample ns1 <- sample(fullsamp, length(fullsamp)*.8, replace=FALSE) # Select those i's from nbd2k train.sample <- to.model %>% filter(i %in% ns1) # dataset with 20% of full sample tw1 <- setdiff(fullsamp,ns1) test.sample <- to.model %>% filter(i %in% tw1) error_table <- data.frame() ##### Full linear model ##### fit <- lm(full.model.formula,data=train.sample) prediction <- predict(fit,newx = train.sample) error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Full Linear Model","in-sample")) prediction <- predict(fit,newx = test.sample) error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Full Linear Model","out-of-sample")) ##### Forward and Backward selection ##### # maximum model model.max <- glm(formula=full.model.formula, family = "gaussian"(link="identity"), data=train.sample) # minimum model model.min <- glm(formula=minimum.model.formula, family = "gaussian"(link="identity"), data=train.sample) # Forward BIC fit <- step(object=model.min, scope=list(upper=model.max,lower=~1), direction="forward",k=log(nrow(train.sample))) prediction <- predict(fit,newx = train.sample) error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Forward Sel. BIC","in-sample")) prediction <- predict(fit,newx = test.sample) error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Forward Sel. BIC","out-of-sample")) # Forward AIC fit <- step(object=model.min, scope=list(upper=model.max,lower=~1), direction="forward",k=2) prediction <- predict(fit,newx = train.sample) error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Forward Sel. AIC","in-sample")) prediction <- predict(fit,newx = test.sample) error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Forward Sel. AIC","out-of-sample")) # Backward BIC fit <- step(model.max, direction="backward",k=log(nrow(to.model))) prediction <- predict(fit,newx = train.sample) error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Backward Sel. BIC","in-sample")) prediction <- predict(fit,newx = test.sample) error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Backward Sel. BIC","out-of-sample")) # Backward AIC fit <- step(model.max, direction="backward",k=2) prediction <- predict(fit,newx = train.sample) error_table %<>% bind_rows(compute_error_measures(train.sample$corr,prediction,"Backward Sel. AIC","in-sample")) prediction <- predict(fit,newx = test.sample) error_table %<>% bind_rows(compute_error_measures(test.sample$corr,prediction,"Backward Sel. AIC","out-of-sample")) ##### Glmnet ##### set.seed(500) glmnet.errors <- lapply(c(0,0.5,1.0),function(alpha){ # Get output vector for glmnet y <- train.sample %>% pull(all.vars(full.model.formula)[1]) # Get predictors for glmnet x <- model.matrix(full.model.formula,train.sample)[,-1] set.seed(500) fit <- cv.glmnet(x,y,family="gaussian",alpha=alpha) prediction <- predict(fit,newx = x,s = "lambda.min") train.error <- compute_error_measures(train.sample$corr,prediction,paste0("glmnet, alpha=",alpha),"in-sample") xnew <- model.matrix(full.model.formula,test.sample)[,-1] prediction <- predict(fit,newx = xnew,s = "lambda.min") test.error <- compute_error_measures(test.sample$corr,prediction,paste0("glmnet, alpha=",alpha),"out-of-sample") bind_rows(train.error,test.error) }) %>% bind_rows() error_table %<>% bind_rows(glmnet.errors) ##### PCR ##### set.seed(100) fit <- pcr(full.model.formula,data=train.sample,scale=TRUE,validation="CV", jackknife=TRUE) ncomps <- c(20,30,selectNcomp(fit,method="onesigma",plot=TRUE)) %>% sort pcr_errors <- lapply(ncomps,function(ncomp){ prediction <- predict(fit,newdata = train.sample, type = "response",ncomp = ncomp) train.error <- compute_error_measures(train.sample$corr,prediction,paste0("PCR, ncomp=",ncomp),"in-sample") prediction <- predict(fit,newx = test.sample,ncomp = ncomp) test.error <- compute_error_measures(test.sample$corr,prediction,paste0("PCR, ncomp=",ncomp),"out-of-sample") bind_rows(train.error,test.error) }) %>% bind_rows() error_table %<>% bind_rows(pcr_errors) ##### PLS ##### set.seed(100) fit <- plsr(full.model.formula,data=train.sample,scale=TRUE,validation="CV", jackknife=TRUE) summary(fit) validationplot(fit) # 32 components are enough to account for at least 90% of variaiblity in both X and Y. Also use optimal number of components ncomps <- c(32,selectNcomp(fit,method="onesigma",plot=TRUE)) %>% sort pls_errors <- lapply(ncomps,function(ncomp){ prediction <- predict(fit,newdata = train.sample, type = "response",ncomp = ncomp) train.error <- compute_error_measures(train.sample$corr,prediction,paste0("PLS, ncomp=",ncomp),"in-sample") prediction <- predict(fit,newx = test.sample,ncomp = ncomp) test.error <- compute_error_measures(test.sample$corr,prediction,paste0("PLS, ncomp=",ncomp),"out-of-sample") bind_rows(train.error,test.error) }) %>% bind_rows() error_table %<>% bind_rows(pls_errors) ###### Table 4 ##### # Prediction errors from models to correct bias in indirect estimates of U5M # Save error table ft <- error_table %>% mutate_at(vars(one_of("rmse","rmedse")),funs(sprintf("%0.6f",.))) %>% flextable() %>% set_header_labels(rmse="Root mean\nsquare error", rmedse="Root median\nsquare error", mre="Mean relative\nerror", medre="Median relative\nerror") %>% merge_v(j=1) doc <- read_docx() %>% body_add_par(value = "Model selection.", style = "table title") %>% body_add_flextable(ft) doc %>% print("tables/table4.docx") # glmnet with alpha = 1 was chosen to be the best-fitting-model ##### Best fitting model ##### # Get output vector for glmnet y <- to.model %>% pull(all.vars(full.model.formula)[1]) # Get predictors for glmnet x <- model.matrix(full.model.formula,to.model)[,-1] alpha <- 1.0 set.seed(500) best.fitting.model <- cv.glmnet(x,y,family="gaussian",alpha=alpha) lambda <-best.fitting.model$lambda.min # To use for predictions later. best.fitting.model <- glmnet(x,y,family="gaussian",alpha=alpha,lambda = lambda) save(x,best.fitting.model,full.model.formula,file="results/best_fitting_model.RData")
knitr::include_graphics("figure4-ori.png")
knitr::include_graphics("figure5-ori.png")
Figure 4 (Fig. \@ref(fig:ori-fig-4)) and 5 (Fig. \@ref(fig:ori-fig-5)) in the original article were created by applying the best-performing model as well as an independent model [@ward2008effect] to the empirical data from 2010 DHS for Malawi and Tanzania [@nationalbureauofstatistics-nbs_tanzania_icf_macro_2011][@nationalstatisticaloffice_icf_macro_2011] respectively to estimate U5M of different age groups of population. By comparing differences in U5M estimates of unadjusted data and adjusted data by two models, the authors suggested that the new model was able to correct the bias from indirect estimates.
With the new simulation data and best-fitting model, we re-generated both figures by using R scripts provided (Fig. \@ref(fig:rep-fig-4) and Fig. \@ref(fig:rep-fig-5)).
To prepare data for plotting, we created three functions with R code provided by the authors to estimate U5M using the indirect method, the Ward-Zaba model and our best-performing model. These three functions were then further integrated into a function called plot_fig to create our replicated figures. We also included a file with necessary data for calculating the confidence interval by bootstrapping for prediction with our best-fitting model, since this step usually takes more than 40 minutes to finish.
indirect_estimate <- function(dataset, fig_country){ # extract data of the plotting country cntry <- dataset %>% filter(country==fig_country) # indirect estimates with empirical data source("R/indirect_estimates_functions_for_reg.R") ind_surv <- ind_est_computations(cntry,2010) %>% as.data.frame() %>% mutate(agegroup=1:7) ind_surv %<>% select(agegroup,fiveq0=fiveq0, t_ref,refdate) # merge indirect estimates based on empirical data from Malawi or Tanzania with # data on HIV, ART and fertility from that same country # ind_est contains the results of indirect_estimates_functions applied to the data from Malawi or Tanzania # cntry contains the data from Malawi or Tanzania (from facevalidity.csv) iep <- cntry %>% left_join(ind_surv,by="agegroup") %>% rename(fiveq0_surv=fiveq0) %>% mutate(agegroup=factor(agegroup)) %>% select(-cd) colnames(iep) <- c("country","agegroup", "ceb","cs","tfr2010","tfr2000","hiv1990", "hiv2000", "hiv2010","art2005","art2006","art2007","art2008","art2009","art2010","art2011","art_prev2005","art_prev2006","art_prev2007","art_prev2008","art_prev2009","prop15to19_2010","prev15to19_2010","hiv2005","hiv2006","hiv2007","hiv2008","hiv2009","fiveq0_surv", "t_ref","refdate") return(iep) }
wz_model <- function(dataset, fig_country, iep){ # extract data of the plotting country cntry <- dataset %>% filter(country==fig_country) # Ward & Zaba (2008) coefficients a=c(.1134,.1202,.1107,.1720,.1749,.1504,.2552) b=c(-.0226,-.0438,-.0882,-.1412,-.1758,-.1849,-.3269) c=c(-.1112,-.0978,-.0373,.0163,.042,.1807,.5594) # W&Z adjustment factor from applying coefficients to empirical data nz = a*iep$hiv2000+b*(iep$hiv2000^2)+c*iep$prev15to19_2010 #Convert each n q 0 + n(z) into 5q0 # Standard indirect estimates, adding Ward&Zaba adj factors # n = c(1,2,3,5,10,15,20) # UN General Female LT, e(0)=60, for above n l = c(.92688,.91025,.90151,.89193,.89534,.89002,.88195) q = 1-l logitMLT=.5*log(q/(1-q)) alpha = NA fiveq0WZ = NA Y5 = NA nq0 <- ind_est_computations_core(cntry,2010)$nq0 for(i in 1:7){ Y5[i] = .5*log((nq0[i]+nz[i])/(1-(nq0[i]+nz[i]))) alpha[i] = Y5[i] - logitMLT[i] fiveq0WZ[i] = exp(2*(alpha[i]+logitMLT[4]))/(1+exp(2*(alpha[i]+logitMLT[4]))) } # fiveq0WZ are the Ward & Zaba estimates of U5M return(fiveq0WZ) }
best_model <- function(iep, fig_name){ # Load saved best fit model # Contains best.fitting.model and full.model.formula load("results/best_fitting_model.RData") # Load bootstrap estimates for model predictions if(file.exists("results/boot_outs.RData")){ load("results/boot_outs.RData") }else{ boot_outs <- list() } # Get the new predictions in matrix format nx <- model.matrix(full.model.formula,iep %>% mutate(corr=0))[,-1] # Make predictions prediction <- predict(best.fitting.model,newx = nx,se=TRUE,s="lambda.min") %>% as.data.frame %>% set_colnames("PredictedAdj") # Bootstrap confidence intervals for predictions if(is.null(boot_outs[[fig_name]])){ # Get output vector for glmnet y <- to.model %>% pull(all.vars(full.model.formula)[1]) # Get predictors for glmnet x <- model.matrix(full.model.formula,to.model)[,-1] Xy <- cbind(y,x) boot_models <- list(length=5) t0 <- predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector tt <- boot_models[1:5] %>%purrr::map(~predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector) %>% unlist() #lambda <- best.fitting.model$lambda.min lambda <- best.fitting.model$lambda theta <- as.matrix(coef(best.fitting.model, s = "lambda.min")) pi_hat <- predict(best.fitting.model, newx = x, s = "lambda.min", type = "response") n <- length(pi_hat) y_star <- sapply(seq_len(length(tt)), function(i) ifelse(runif(n) <= pi_hat, 1, 0)) suff_stat <- t(y_star) %*% x #bcapar(t0=t0,tt=tt,bb=suff_stat) # This is the function that we will use for each bootstrap sample theta <- function(xy,nx,glmnet_alpha,lambda){ X <- xy[,2:(dim(xy)[2])] Y <- xy[,1] # Fit the model for the sample and using the same lambda as in the best model fitted m <- glmnet(X,Y,family="gaussian",alpha=glmnet_alpha,lambda=lambda) # Keep the models computed for each bootstrap sample in order to compute prediction intervals faster # boot_models <<- c(boot_models,list(m)) # Prediction predict(m,newx = nx,s="lambda.min") %>% as.vector } # Keep the models computed for each bootstrap sample in order to compute prediction intervals faster # boot_models <- list() # Compute bootstrapped confedence intervals for each prediction point boot_outs[[fig_name]] <- lapply(1:nrow(nx),function(i){ # i <- 1 new_data <- nx[i,,drop=FALSE] set.seed(600) bcajack(x=Xy,B=500,func=theta,new_data,1.0,lambda) }) } # Get two tailed 95% CI values ci <- boot_outs[[fig_name]] %>% lapply(function(r){ r$lims %>% as.data.frame() %>% mutate(quantile=as.numeric(rownames(.))) %>% filter(quantile %in% c(0.025,0.975)) %>% select(quantile,bca) %>% mutate(quantile=case_when(quantile==0.025 ~ "ci.lo",TRUE~ "ci.hi")) %>% spread(quantile,bca) }) %>% bind_rows() prediction %<>% bind_cols(ci) return(prediction) }
plot_fig <- function(fig_name, fig_country){ # Load empirical data fv <- fread("data/facevalidity.csv") %>% set_colnames(c("country","agegroup", "ceb","cs","tfr2010","tfr2000","hiv1990", "hiv2000", "hiv2010","art2005","art2006","art2007","art2008","art2009","art2010","art2011","art_prev2005","art_prev2006","art_prev2007","art_prev2008","art_prev2009","prop15to19_2010","prev15to19_2010","hiv2005","hiv2006","hiv2007","hiv2008","hiv2009")) %>% mutate(cd=ceb-cs) # indirect estimates with empirical data iep <- indirect_estimate(fv, fig_country) # estimate from Ward & Zaba (2008) model fiveq0WZ <- wz_model(fv, fig_country, iep) # make prediction with best-performing model prediction <- best_model(iep, fig_name) # Prepare data for plot to.plot <- bind_cols(iep, prediction,data.frame(fiveq0WZ=fiveq0WZ)) %>% mutate(fit=fiveq0_surv+PredictedAdj,lwr=fiveq0_surv+ci.lo,upr=fiveq0_surv+ci.hi)%>% # Compute adjusted values and prediction interval upper and lower limits select(refdate,upr,lwr,Adjusted=fit,Unadjusted=fiveq0_surv,`Ward & Zaba`=fiveq0WZ) %>% # Transform data for plot gather(type,value,-refdate,-upr,-lwr) # Plot ggplot(to.plot,aes(x=refdate,y=value,shape=type)) + geom_point(color="black", size=3) + geom_errorbar(aes(ymin=lwr, ymax=upr))+ xlab("Year") + ylab("Under-five mortality (deaths per live birth)") + scale_shape_discrete(name="Estimate type", # Legend label, use darker colors labels=c("Adjusted", "Unadjusted","Ward & Zaba"))+ labs(title= fig_country) + theme_classic() + theme(axis.text = element_text(color="black")) }
# Figure 4 # Under-five mortality estimates using CEB/CS from 2010 DHS for Malawi: # unadjusted (crude), adjusted (corrected) using best-performing model, and using Ward & # Zaba (2008) model plot_fig('figure4','Malawi')
# Figure 5 # Under-five mortality estimates using CEB/CS from 2010 DHS for Tanzania: # unadjusted (crude), adjusted (corrected) using best-performing model, and using Ward & # Zaba (2008) model plot_fig('figure5','Tanzania')
We revised some parts of the regression model and corrected inconsistent variable names in the original code during the debugging process. For instance, caught corrected the variable name “lambda.min” where it should have been “lambda”.
boot_models <- list(length=5) t0 <- predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector tt <- boot_models[1:5] %>%purrr::map(~predict(best.fitting.model,newx = x,s="lambda.min") %>% as.vector) %>% unlist() #lambda <- best.fitting.model$lambda.min lambda <- best.fitting.model$lambda theta <- as.matrix(coef(best.fitting.model, s = "lambda.min")) pi_hat <- predict(best.fitting.model, newx = x, s = "lambda.min", type = "response") n <- length(pi_hat) y_star <- sapply(seq_len(length(tt)), function(i) ifelse(runif(n) <= pi_hat, 1, 0)) suff_stat <- t(y_star) %*% x
Our implementation of the model to estimate the bias in U5M in HIV/AIDS affected populations from @quattrochi2019 was able to replicate the process and output target tables and figures. Consistently, indirect methods underestimate U5M in older populations, while little difference was shown in younger age groups between unadjusted and adjusted estimates. Applying our model to 2010 survey data from Malawi and Tanzania, we show that indirect methods would underestimate U5M by up to 7%.
Table 2 (Fig. \@ref(fig:table2)) shows our estimation by applying the best performing model on 1,000 simulated populations and consistent with the original implementation, the mean HIV prevalence we calculated among women aged 15–49 across populations was 7% in 1990, 15% in 2000, and 11% in 2010. The highest HIV prevalence in any simulation was 40% in 2000. Our results related to ART showed the same trend as in the original paper that both prevalence and coverage increase with time, although the exact numbers varied. The mean ART coverage across our simulations was 0% in 2004 and 8% in 2010. The mean ART prevalence was < 0.1% in 2005 and 0.19% in 2009. The highest ART prevalence in any simulation was 2.3% in 2009 and this was 1.7% lower compared to the original model. The mean TFR across simulations was 4.88 in 2000 and 4.28 in 2010, also consistent with the original model.
knitr::include_graphics("table2_re.png")
Table 3 (Fig. \@ref(fig:table3)) shows the bias in indirect estimates across age groups. According to the original paper, negative numbers indicate that unadjusted estimates were lower than adjusted estimates. The mean absolute bias was largest for estimates from women aged 35–39 and 40–44 (−0.021) and smallest for estimates from women aged 15–19 (0.000). Our results showed the same age group with a small variance in the exact numbers. The original paper also emphasized on the largest absolute bias recorded, which was −0.071 for estimates from women 35–39 in our calculation and -0.069 for them. The paper pointed out that this means that the estimated U5M was 71 deaths per 1000 live births lower when using only reports from surviving women compared to reports from surviving women and women who died from HIV/AIDS.
knitr::include_graphics("table3_re.png")
Based on our results, we believe that our effort in replicating @quattrochi2019 was successful despite some inconsistency. We argue that the differences could be due to the smaller population size we used in our simulation though further investigation would be needed to discover the exact causes.
Our goal of this project was to replicate a published research paper and reproduce figures in it. We were able to accomplish this goal by replicating the key figures in the original paper using R and publishing well-structured source code along with data source in a licensed Github repository. This paper.Rmd R Markdown file is fully executable therefore can be used as the main entry point to our project. However, we intentionally set a couple code chunks with evaluation as False since simulating those data sources does not fit within the timeframe for any group trying to reproduce them. Instead, we created .RData file as a battery-included solution for future reproducers.
We leveraged some useful reproducibility techniques introduced in the class such as Tidyverse libraries and R Markdown syntax. The in-file executable R chunks come extremely handy during our reproduction process.
There are, however, challenges throughout our explorations. The first major issue we encountered was the lengthy source data generation step which required extensive research and deep dive into the source code. Given the nature of the HIV/AIDS estimation research, the data samples required had to be comprehensive enough to cover multiple geological locations and a wide range of populations. Therefore, the original authors designed the raw data to be generated on-the-go rather than archived in open source. Our team initially underestimated the challenge of reproducing all sources and this was also due to the lack of information on completion time in the original paper. We had to reduce population size during simulation to ensure the quality of our research and a timely deliverable. Looking back, we can identify this as an area that we could improve (focusing on verification of data source before replication). Another challenge was using R in an unfamiliar field of research. Given none of us had previous domain knowledge regarding HIV/AIDS research, we spent a lot of time to understand the terminologies while exploring the rationales behind the R code used in the original paper. We had issues trying to understnad specific failures in R givern that multiple team members had different R environment. For example, we still couldn’t fully understand some of the “knit” failures which had been resolved by cleaning up cache in R Studio. We strongly believe that using Docker or a shared development environment could drastically improve our developer experience where most errors could be reproduced easily no matter where the Docker image is running.
In general, we gained hands-on experience from reproducing a research project. After rounds of reviews and merging conflicts, we developed our own Github usage convention and code styles. We would like to carry these invaluable skills to our future Data Science journey.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.