Confidence Intervals

#load various variable definitions that are the same for each app
source('startup_script.R')
sapply(files_to_source, source) #source some helper files defined in the files_to_source variable
currentrmdfile = knitr::current_input()  #get current file name
appsettings = get_settings(currentrmdfile,appdocdir,packagename) #get settings for current app

Overview {#shinytab1}

This app illustrates the method of bootstrapping data, which is one way one can obtain confidence intervals for parameters when fitting mechanistic models to data.

Learning Objectives

The Model {#shinytab2}

Model Overview

The data and the underlying simulation model are the same as in the Basic Virus Model Fitting app. If you haven't worked your way through that app, do that first.

The underlying model that is being fit to the data is again the Basic Virus Model; check out that app for a more detailed description.

Model Diagram

The model is represented by this flow diagram:

knitr::include_graphics(path = paste0("../media/",appsettings$modelfigname))

Model Equations

The model is implemented as a set of ordinary differential equations:

\begin{align} \dot U & = n - d_U U - bUV \ \dot I & = bUV - d_I I \ \dot V & = pI - d_V V - gb UV \end{align}

Model Concepts

Fitting Approach

The basic fitting approach is the one described in the Basic Virus Model Fitting app, i.e. we minimize the sum of squares of the log differences between model predictions and virus load data, taking censored values into account. One minute difference is, now, we only estimate parameters b and d~V~.

There are a few new components in this app. You can fit the parameters either in linear or log space. Fitting in log space can be useful if you have a large range of parameters you want to try. The underlying scientific problem is the same, no matter what scale you fit your parameters on, but sometimes one version or the other can lead to better performance of your solver.

The fit is run multiple times. First, the model is fit to the actual data. Then, the model is fit again for a specified number of bootstrap samples of data created by the bootstrap sampling process.

Bootstrapping

The main new feature of this app is the inclusion of bootstrapping: a sampling process that allows one to obtain confidence intervals for the estimated parameters. The basic approach goes like this:

  1. Resample your data (with replacement), to get a new dataset as large as the original one. For instance if you had 20 data points in the original dataset, you'll again have 20 data points. But now some of them might occur more than once, and others might be missing. This approach tries to mimic the idea that if you had done another experiment, you might have gotten slightly different data.
  2. Fit your model to this newly generated dataset you obtained through sampling. Record the best-fit estimates for your parameters.
  3. Do this sample, then re-fit (steps 1 and 2) approach multiple times (generally 100s or more) to obtain distributions for your parameter estimates.
  4. From these distributions, extract measures of uncertainty, e.g., 95% confidence intervals.

Notes

What To Do {#shinytab3}

The model is assumed to run in units of days.

#this is the running counter for the records which starts at 1 
rc=1

#empty object, will hold all outcomes
alloutcomes = NULL

#########################
# Task 1
#########################
tid = 1
tasktext = "Keep all inputs at their default values. Run the simulation. This will perform _nsample = 10_ fits on resampled data for _iter = 20_ iterations each (much too few for any real problem, but we'll keep it low to make things run reasonably quickly). Instead of a single best-fit estimate for the parameters, you now get 10 estimates from which one can compute uncertainty measures such as confidence intervals. You see the lower and upper bounds for a 95% confidence interval reported below the plot. For this example, you should find the best estimate (fitting to the actual data) for the parameter _b_ to be 0.017, while the bounds for this parameter, obtained through repeat-fitting to the bootstrap samples, are 0.011 and 0.018. 

\nSince bootstrapping involves random sampling, different random number seeds will lead to different results. First, confirm that if you keep the seed the same, the result will be the same by running the simulation again. Then, change the seed to _rngseed = 99_. The point estimate won't change, because it's again the same fit to the data, but the confidence intervals, which involve random sampling of the data, change somewhat. If the speed of your computer allows, increase the number of bootstrap samples (and iterations). Re-run with different random number seeds. As you use more samples, the impact of the random seed setting should diminish and eventually the results should be robust enough that you get essentially the same confidence intervals independent of random number seed.

\n(As a side note, some optimizers/solvers have a random component. In that case, setting a seed even if not resampling the data is important for reproducibility.)"
nrec = 1 # number of items to record
out_records = c("Upper bound for _b_ with _rngseed_ = 99 (round to 3 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 2
#########################
tid = tid + 1
tasktext = "Fitting parameters in log space vs. linear space does not change the scientific assumptions, but it can sometimes help the optimizer perform faster/better. Let's explore this. 

\nReset all inputs, run the simulation. Then, set _parscale = 2_ and repeat. In this case, the results change somewhat. This is maybe not surprising, given that we only take a few steps. In theory, if we ran the solver long enough until it converged (and did enough samples), the results should be the same. However, this does not always happen. Sometimes a solver can get stuck if fitting parameters on one scale but won't get stuck on another scale. This is often the case if a parameter varies over a wide range; in that case fitting in log space is often advantageous. While we can't fully test this (it would take too long), we can explore this a bit. 

\nReset all inputs. Then, set iterations to 500, samples to 1 (i.e., we won't do bootstrap, just fitting of the data). You should get an SSR = 2.53. Now, switch to _parscale = 2_ and redo the fit. You should find a somewhat higher SSR. Repeat with 1000 iterations (and if your computer is fast enough, 2000). You should find that the SSR do not change. Thus, the same solver ended up at a different best fit, and the rescaling did in fact not help. You can explore different starting values and see how that impacts results. 

\nThe overall take-home message from this is that sometimes fitting parameter in their natural (linear) scale is best; other times, transforming them, e.g. to a log scale, improves results. Unfortunately, it is hard to know beforehand if transformation helps. Therefore, often trying it both ways is best." 

nrec = 1 # number of items to record
out_records = c("SSR for _parscale_ = 2 and 500 iterations (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 3
#########################
tid = tid + 1
tasktext = "Expore the app further. The main ideas you should get out of it are that bootstrapping the data is one way you can get confidence intervals (at the cost of longer run times), and that while fitting parameters on different scales does not change the scientific question, it can influence the performance of the optimizer."
nrec = 1 # number of items to record
out_records = c("Nothing")
out_types = rep("None",nrec)
out_notes = c("")
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 
#save the fully filled task table to a tsv file
alloutcomes$QuizID = paste0(packagename,"_",appsettings$appid)
alloutcomes$AppTitle = appsettings$apptitle
alloutcomes$AppID = appsettings$appid
#remove a few variables from the data frame
savedoutcomes <- dplyr::select(alloutcomes,QuizID,AppID,AppTitle,TaskID,TaskText,RecordID,Record,Type,Note)     
write.table(savedoutcomes, paste0(appsettings$appid,"_tasktable.tsv"), append = FALSE, sep = "\t", row.names = F, col.names = TRUE)
# Take all the text stored in the table and print the tasks and items to record
write_tasktext(alloutcomes)

Further Information {#shinytab4}

For this app, the underlying function running the simulation is called r appsettings$simfunction. That function repeatedly calls r appsettings$underlying_function.

This app (and all others) are structured such that the Shiny part (the graphical interface you are using) calls one or several underlying R functions which run the simulation for the model of interest and return the results. You can call them directly, without going through the shiny app. Use the help() command for more information on how to use the functions directly. If you go that route, you need to use the results returned from this function and produce useful output (such as a plot) yourself.

You can also download all simulator functions and modify them for your own purposes. Of course to modify these functions, you'll need to do some coding.

For examples on using the simulators directly and how to modify them, read the package vignette by typing vignette('DSAIRM') into the R console.

The bootstrapping routine is implemented with the boot package, which does the sampling and repeatedly calls the nloptr solver, which in turn runs the underlying simulation model. There are different versions of bootstrapping, i.e. data sampling. Some make assumptions about the distribution of the data (parametric bootstrap), others are very useful for time-series data, e.g. the concept of block-bootstrapping. See for instance [@efron93] for more information.

References



Try the DSAIRM package in your browser

Any scripts or data that you put into this service are public.

DSAIRM documentation built on Aug. 24, 2023, 1:07 a.m.