knitr::opts_chunk$set( collapse = TRUE )
# library(devtools) # install_github('dereksonderegger/BurkPx') library(BurkPx) library(ggplot2)
Our initial research identified 21 differnt antigens that could be predictive of Meliodosis. To create an assay with high sensitivity and specificity for future observations, we created a statistical model using our present data but protected ourselves from overfitting the data by using the shrinkage methods LASSO and ridge regression. These methods modify a a conventional linear model by continuously shrinking the model coeffients towards zero and then selecting a shrinkage amount via k-fold cross-validation. LASSO has the convenient property of pushing the regression coeffiencients to zero, effectively removing the antigen from the regression model, while ridge regression typcially will simply push coeffients to just be small. As a result, LASSO tends to produce a more parsimonious set of antigens to base the assay on.
We first split the human patients into test/training sets and then fit all the various models (IgG, IgM, IgGM, etc) using only patients from the training set. Because we have 100 Meliod patients, then 50 of those patients get assigned to the test group and 50 to the training group. Likewise of the 400 controls, 200 get assigned to the test set and 200 to the training.
Once the patients have been assigned to either the test or training set, all of the patients serologies are included in the set. This means that a single patient with many serologies might have an oversized effect. But we did this to try to keep our sample sizes as high as possible. We created statistical models using the training set using both LASSO and ridge regression. We will then challenge the models and calculate the Receiver Operator Characteristic (ROC) curves and its Area Under the Curve (AUC) using the testing set. Next we create LASSO and ridge regression models using the full set of data, which we will recommend for making predictions about future observations.
For the human data, we generated AUC values for each week after the patient was admitted to the hospital. To generate this by week, we split the testing data into Healthy, Week 1, Week 2, etc. For each, we calculate AUC using the three different human models (IgG, IgM, IgGM).
################################################# ## ROC for each model ## ################################################# time = 'Week1' method = 'LASSO' type = 'IgG' AUC_results <- NULL for(time in c('Week1','Week2','Week3', 'Week4+')){ for(method in c('LASSO', 'Ridge', 'HCP1')){ for(type in c('IgM', 'IgG', 'IgGM')){ for(rep in 1:40){ if( type %in% c('IgM', 'IgG') ){ df <- Human_BurkPx_test %>% filter(Type == type) %>% filter(TimeGroup %in% c('Healthy', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -SerumID, -TimeGroup, -Type, -Rep) %>% complete() %>% sample_frac(1, replace = TRUE) }else{ df <- Human_BurkPx_test %>% group_by(PatientID, Type, Rep) %>% unite( 'Antigen', Type, Antigen ) %>% filter(TimeGroup %in% c('Healthy', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -TimeGroup, -Rep) %>% sample_frac(1, replace = TRUE) } df$p <- predict(models[[str_c('Human_Training_',type,'_',method)]], newdata=df, na.action = na.pass) %>% as.vector() temp <- pROC::roc(Status ~ p, data=df) AUC_results <- AUC_results %>% rbind( data.frame(Type = type, TimeGroup=time, method=method, Train = 'Training', rep=rep, AUC=pROC::auc(temp) ) ) df$p <- predict(models[[str_c('Human_Full_',type,'_',method)]], newdata=df, na.action=na.pass) %>% as.vector() temp <- pROC::roc(Status ~ p, data=df) AUC_results <- AUC_results %>% rbind( data.frame(Type = type, TimeGroup=time, method=method, Train = 'Full', rep = rep, AUC=pROC::auc(temp) ) ) } } } }
AUC_results %>% group_by( Type, TimeGroup, method, Train ) %>% summarise( lwr = quantile(AUC, .025), upr = quantile(AUC, .95), AUC = mean(AUC) ) %>% filter(method != 'HCP1') %>% ggplot(., aes(x=TimeGroup, y=AUC, color=method)) + facet_grid(Train ~ Type) + geom_point() + geom_errorbar( aes(ymin=lwr, ymax=upr), width=.3) + geom_line(aes(x=as.numeric(TimeGroup))) + theme(axis.text.x = element_text(angle=-45, vjust=0, hjust=.5) ) + labs(title='AUC values: Human Models trained on all timepoints')
The graph columns denote the serum type and the graph rows represent which dataset the model was trained on. The training row was trained on the training set and tested on the test set. However the full row has models built using both the training and test sets and then challenged by the test set.
From this analysis, it is clear that for the first week, we need both the IgG and IgM sereologies. It also seems that the LASSO might be working better than Ridge Regression and that IgG is working better than IgM. However the best performance is by the IgGM data which uses both IgG and IgM observation values.
Next we look at the human models that were trained on only the Week 1 and 2 data. We compare the models against the same version 1 test set so that the comparisons
################################################# ## ROC for each model ## ################################################# time = 'Week1' method = 'LASSO' type = 'IgG' AUC_results2 <- NULL for(time in c('Week1','Week2','Week3', 'Week4+')){ for(method in c('LASSO', 'Ridge', 'HCP1')){ for(type in c('IgM', 'IgG', 'IgGM')){ for( rep in 1:40 ){ if( type %in% c('IgM', 'IgG') ){ df <- Human_BurkPx_test %>% filter(Type == type) %>% filter(TimeGroup %in% c('Healthy', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -SerumID, -TimeGroup, -Type, -Rep) %>% complete() %>% sample_frac(1, replace = TRUE) }else{ df <- Human_BurkPx_test %>% group_by(PatientID, Type, Rep) %>% unite( 'Antigen', Type, Antigen ) %>% filter(TimeGroup %in% c('Healthy', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -TimeGroup, -Rep) %>% sample_frac(1, replace = TRUE) } df$p <- predict(models[[str_c('Human_Training2_',type,'_',method)]], newdata=df, na.action = na.pass) %>% as.vector() temp <- tryCatch({pROC::roc(Status ~ p, data=df) %>% pROC::auc(.)}, error=function(err){NA}, warning=function(war){NA}) AUC_results2 <- AUC_results2 %>% rbind( data.frame(Type = type, TimeGroup=time, method=method, Train = 'Training', rep = rep, AUC=temp ) ) df$p <- predict(models[[str_c('Human_Full2_',type,'_',method)]], newdata=df, na.action=na.pass) %>% as.vector() temp <- tryCatch({pROC::roc(Status ~ p, data=df) %>% pROC::auc(.)}, error=function(err){NA}, warning=function(war){NA}) AUC_results2 <- AUC_results2 %>% rbind( data.frame(Type = type, TimeGroup=time, method=method, Train = 'Full', rep = rep, AUC=temp ) ) } } } }
AUC_results2 %>% group_by( Type, TimeGroup, method, Train ) %>% summarise( lwr = quantile(AUC, .025), upr = quantile(AUC, .95), AUC = mean(AUC) ) %>% filter(method != 'HCP1') %>% ggplot(., aes(x=TimeGroup, y=AUC, color=method)) + facet_grid(Train ~ Type) + geom_point() + geom_errorbar(aes(ymin=lwr, ymax=upr), width=.3) + geom_line(aes(x=as.numeric(TimeGroup))) + theme(axis.text.x = element_text(angle=-45, vjust=0, hjust=.5) ) + labs(title='AUC values: Human Models trained on Week 1 and 2')
For each of the full LASSO models, it would be helpful to know which antigens where selected for inclusion into the model.
# Which covariates are used in the IgG LASSO model? temp <- coef( models$Human_Full2_IgG_LASSO ) data.frame(Antigen=rownames(temp), Coef=as.vector(temp)) %>% filter(Antigen != '(Intercept)', Coef != 0 ) %>% pander::pander()
# Which covariates are used in the IgM LASSO model? temp <- coef( models$Human_Full2_IgM_LASSO ) data.frame(Antigen=rownames(temp), Coef=as.vector(temp)) %>% filter(Antigen != '(Intercept)', Coef != 0 ) %>% pander::pander()
# Which covariates are used in the IgGM LASSO model? temp <- coef( models$Human_Full2_IgGM_LASSO ) data.frame(Antigen=rownames(temp), Coef=as.vector(temp)) %>% filter(Antigen != '(Intercept)', Coef != 0 ) %>% pander::pander()
# Of those, which are used in both IgG and IgM? data.frame(Antigen=rownames(temp), Coef=as.vector(temp)) %>% filter(Antigen != '(Intercept)', Coef != 0 ) %>% separate(Antigen, into=c('Type','Antigen'), extra='merge') %>% group_by(Antigen) %>% arrange(Antigen) %>% count() %>% pander::pander()
For the LASSO models, as the shrinkage parameter $\lambda$ increases, the number of non-zero model coefficients decreases. We can examine how the AUC decreases as a function of $\lambda$, and therefore indirectly as a function of number of non-zero model coefficients. In the following graphs, the numbers along the top of the graph represent the number of non-zero coefficients. The dotted line on the right represents the selected shrinkage value we used for the Full LASSO models.
par(mfrow=c(3,1)) plot(models$Human_Full_IgGM_LASSO, main='IgGM \n') plot(models$Human_Full_IgG_LASSO, main='IgG \n') plot(models$Human_Full_IgM_LASSO, main='IgM \n')
The nonhuman primate data was collected from both the Batelle and Tulane information. When I created the testing and training sets, we split the Batelle individual subjects equally into the test and training sets and likewise split the Tulane negative individuals equally into the test and training sets. We created two NHP train/test sets. The first version Batelle animals were selected randomly from Batelle animals that survived for the full experiment. The second training/testing set were created from the complete dataset.
out <- NHP_BurkPx_train1 %>% group_by(PatientID, Origin, Status) %>% count() %>% group_by(Origin, Status) %>% count() %>% mutate(Type = 'Training' ) out <- rbind(out, NHP_BurkPx_test1 %>% group_by(PatientID, Origin, Status) %>% count() %>% group_by(Origin, Status) %>% count() %>% mutate(Type = 'Test' )) out %>% pander::pander()
We next look at the AUC values for the Testing set based on either the model trained on just the training set vs the model trained on the full Version 1 dataset. Because we would expect the model trained on the test data would perform well on the test set, we should only draw conclusions from the model built from the training set. The only reason the model built from the full dataset exists is to be able to use it on future data. I show these results here just to verify that the full model doesn't do worse than the training model.
################################################# ## ROC for each model ## ################################################# time = 'Week1' method = 'LASSO' type = 'IgG' AUC_results <- NULL for(time in c('Week1','Week2','Week3', 'Week4+')){ for(method in c('LASSO', 'Ridge', 'HCP1')){ for(type in c('IgM', 'IgG', 'IgGM')){ for( rep in 1:40 ){ if( type %in% c('IgM', 'IgG') ){ df <- NHP_BurkPx_test1 %>% filter(Type == type) %>% filter(TimeGroup %in% c('Healthy','Prior', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -SerumID, -TimeGroup, -Type, -Rep) %>% complete() %>% sample_frac(1, replace = TRUE) }else{ df <- NHP_BurkPx_test1 %>% group_by(PatientID, Type, Rep) %>% unite( 'Antigen', Type, Antigen ) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -TimeGroup, -Rep) %>% complete() %>% sample_frac(1, replace = TRUE) } df$p <- predict(models[[str_c('NHP_Training1_',type,'_',method)]], newdata=df, na.action = na.pass) %>% as.vector() AUC <- tryCatch({pROC::roc(Status ~ p, data=df) %>% pROC::auc(.)}, error=function(err){NA} ) AUC_results <- AUC_results %>% rbind( data.frame(Type = type, TimeGroup = time, Training='Training', method=method, rep=rep, AUC=AUC ) ) df$p <- predict(models[[str_c('NHP_Full1_',type,'_',method)]], newdata=df, na.action=na.pass) %>% as.vector() AUC <- tryCatch({pROC::roc(Status ~ p, data=df) %>% pROC::auc(.)}, error=function(err){NA} ) AUC_results <- AUC_results %>% rbind( data.frame(Type = type, TimeGroup = time, Training='Full', method=method, rep=rep, AUC=AUC) ) } } } }
AUC_results %>% group_by( Type, TimeGroup, method, Training ) %>% summarise( lwr = quantile(AUC, .025, na.rm=TRUE), upr = quantile(AUC, .95, na.rm=TRUE), AUC = mean(AUC, na.rm=TRUE) ) %>% filter(method != 'HCP1') %>% ggplot(., aes(x=TimeGroup, y=AUC, color=method)) + facet_grid( Training ~ Type ) + geom_point() + geom_errorbar(aes(ymin=lwr, ymax=upr), width=.3) + geom_line(aes(x=as.numeric(TimeGroup))) + theme(axis.text.x = element_text(angle=-45, vjust=0, hjust=.5) ) + labs(title='AUC values: NHP Models trained on all Time points')
For the Non-human primates, we see that the models that were trained on only the training data still did quite well in predicting the testing set with AUC values in the 90s. However the models trained with both the test and training sets (aka "Full" data), did a better job predicting the testing set, as we would expect. Notably, there isn't a huge difference in prediction capabilities and, unlike the human data, we see that the IgM serum is better at prediction than the IgG.
out <- NHP_BurkPx_train3 %>% group_by(PatientID, Origin, Status) %>% count() %>% group_by(Origin, Status) %>% count() %>% mutate(Type = 'Training' ) out <- rbind(out, NHP_BurkPx_test3 %>% group_by(PatientID, Origin, Status) %>% count() %>% group_by(Origin, Status) %>% count() %>% mutate(Type = 'Test' )) out %>% pander::pander()
We next look at the AUC values for the Testing set based on either the model trained on just the training set vs the model trained on the full Version 1 dataset. Because we would expect the model trained on the test data would perform well on the test set, we should only draw conclusions from the model built from the training set. The only reason the model built from the full dataset exists is to be able to use it on future data. I show these results here just to verify that the full model doesn't do worse than the training model.
################################################# ## ROC for each model ## ################################################# time = 'Week1' method = 'LASSO' type = 'IgG' AUC_results <- NULL for(time in c('Week1','Week2','Week3', 'Week4+')){ for(method in c('LASSO', 'Ridge', 'HCP1')){ for(type in c('IgM', 'IgG', 'IgGM')){ for( rep in 1:40 ){ if( type %in% c('IgM', 'IgG') ){ df <- NHP_BurkPx_test2 %>% filter(Type == type) %>% filter(TimeGroup %in% c('Healthy','Prior', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -SerumID, -TimeGroup, -Type, -Rep) %>% complete() %>% sample_frac(1, replace = TRUE) }else{ df <- NHP_BurkPx_test2 %>% group_by(PatientID, Type, Rep) %>% unite( 'Antigen', Type, Antigen ) %>% filter(TimeGroup %in% c('Healthy','Prior', time)) %>% spread(Antigen, Value) %>% group_by() %>% select(-PatientID, -TimeGroup, -Rep) %>% sample_frac(1, replace = TRUE) } out <- tryCatch({ df$p <- predict(models[[str_c('NHP_Training3_',type,'_',method)]], newdata=df, na.action = na.pass) %>% as.vector() temp <- pROC::roc(Status ~ p, data=df) AUC_results <- AUC_results %>% rbind( data.frame(Type = type, TimeGroup = time, Training='Training', method=method, AUC=pROC::auc(temp) ) ) }, error=function(err){NULL}, warning=function(war){NULL} ) out <- tryCatch({ df$p <- predict(models[[str_c('NHP_Full3_',type,'_',method)]], newdata=df, na.action=na.pass) %>% as.vector() temp <- pROC::roc(Status ~ p, data=df) AUC_results <- AUC_results %>% rbind( data.frame(Type = type, TimeGroup = time, Training='Full', method=method, AUC=pROC::auc(temp) ) ) }, error=function(err){NULL}, warning=function(war){NULL} ) } } } }
AUC_results %>% group_by( Type, TimeGroup, method, Training ) %>% summarise( lwr = quantile(AUC, .025, na.rm=TRUE), upr = quantile(AUC, .95, na.rm=TRUE), AUC = mean(AUC, na.rm=TRUE) ) %>% filter(method != 'HCP1') %>% ggplot(., aes(x=TimeGroup, y=AUC, color=method)) + facet_grid( Training ~ Type ) + geom_point() + geom_errorbar(aes(ymin=lwr, ymax=upr), width=.3) + geom_line(aes(x=as.numeric(TimeGroup))) + theme(axis.text.x = element_text(angle=-45, vjust=0, hjust=.5) ) + labs(title='AUC values: NHP Models trained on week 1 and 2')
For the Non-human primates, we see that the models that were trained on only the training data still did quite well in predicting the testing set with AUC values in the 90s. However the models trained with both the test and training sets (aka "Full" data), did a better job predicting the testing set, as we would expect. Notably, there isn't a huge difference in prediction capabilities and, unlike the human data, we see that the IgGM combination of serums is better at prediction than the IgG serum.
par(mfrow=c(3,1)) plot(models$NHP_Full2_IgGM_LASSO, main='IgGM \n') plot(models$NHP_Full2_IgG_LASSO, main='IgG \n') plot(models$NHP_Full2_IgM_LASSO, main='IgM \n')
Because of the reduced sample size compared to the Human models, the NHP models were fitted by minimizing the binomial deviance instead of AUC.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.