Basic Virus Model Fitting

#load various variable definitions that are the same for each app
source('startup_script.R')
sapply(files_to_source, source) #source some helper files defined in the files_to_source variable
currentrmdfile = knitr::current_input()  #get current file name
appsettings = get_settings(currentrmdfile,appdocdir,packagename) #get settings for current app

Overview {#shinytab1}

This app illustrates how to fit a mechanistic dynamical model to data and how to use simulated data to evaluate if it is possible to fit a specific model.

Learning Objectives

The Model {#shinytab2}

Model Overview

The underlying model that is being fit to the data is the Basic Virus Model used in the app of that name (and some of the others). See that app for a detailed description of the model.

Model Diagram

The model is represented by this flow diagram:

knitr::include_graphics(path = paste0("../media/",appsettings$modelfigname))

Model Equations

The model is implemented as a set of ordinary differential equations:

\begin{aligned} \dot U & = n - d_U U - bUV \ \dot I & = bUV - d_I I \ \dot V & = pI - d_V V - gb UV \end{aligned}

Model Concepts

Data

For this app, viral load data from patients infected with influenza is being fit. The data is average log viral titer on days 1-8 post infection. The data comes from [@hayden96], specifically the 'no treatment' group shown in Figure 2 of this paper.

Our simulation acts as another source of 'data' by producing artificial data.

Fitting Models

This app fits the log viral titer of the data to the virus kinetics produced by the model simulation. The fit is evaluated by computing the sum of square errors between data and model for all data points, i.e., $$ SSR= \sum_t (Vm_t - Vd_t)^2 $$ where $Vm_t$ is the virus load (in log units) predicted from the model simulation at days $t=1\cdots8$ and $Vd_t$ is the data reported in those same units (log10) and on those time points. The underlying code varies model parameters to try to get the predicted viral load from the model as close as possible to the data by minimizing the SSR. The app reports the final SSR for the fit.

In general, with enough data, one could fit/estimate every parameter in the model and the initial conditions. However, with just the virus load data available, the data are not rich enough to allow estimation of all model parameters (even for a model as simple as this). The app is therefore implemented by assuming that most model parameters are known and fixed, and only 3 (the rate of virus production, p, the rate of infection of cells, b, and the rate of virus death/removal, d~V~) are estimated. In general, the choice of fixing some parameters needs to be made based on what makes sense. If you have good estimates for some parameters from outside sources, you can justify fixing them. If not, then you might need to get more data or simplify your model until you can estimate all parameters you want/need to estimate. See the tasks for more on that.

Computer Routines for Fitting

A computer routine does the minimization of the sum of squares. Many such routines, generally referred to as optimizers, exist. For simple problems, e.g., fitting a linear regression model to data, any of the standard routines work fine. For the kind of minimization problem we face here, which involves a differential equation, it often makes a difference what numerical optimizer routine one uses. R has several packages for that purpose. In this app, we make use of the optimizer algorithms called COBYLA, Nelder-Mead and Subplex from the the nloptr package. This package provides access to a large number of optimizers and is a good choice for many optimization/fitting tasks. For more information , see the help files for the nloptr package and especially the nlopt website.

Notes

What To Do {#shinytab3}

The model is assumed to run in units of days.

#this is the running counter for the records which starts at 1 
rc=1

#empty object, will hold all outcomes
alloutcomes = NULL

#########################
# Task 1
#########################
tid = 1
tasktext = "Start with 1e6 uninfected cells, no infected cells, 1 virion (assumed to be in the same units of the data, TCID50/ml), no uninfected cell birth and deaths, lifespan of infected cells 12 hours (make sure to convert to a rate), and unit conversion factor to 0. Set virus production rate to 0.001 with lower and upper bounds of 0.0001 and 100; _psim_ can be anything for now. Set the infection rate to _b_ = 0.1 with lower/upper bounds of 0.001/10; _bsim_ can be anything. Set the virus decay rate to 1 with lower/upper bounds of 0.01/100; _dVsim_ can be anything.

The parameters *p*, *b* and *d~V~* are being fit, the values we specify here are the starting conditions for the optimizer and the upper and lower bounds which the parameters must remain inside. Note that if the lower bound is not lower than / equal to and the upper not higher than / equal to the parameter, you will get an error message when you try to run the model. We can ignore the values for simulated data for now; to do so, set _usesimdata_ to 0. This also means the _noise_ variable is ignored, i.e., you can set it to anything. 

\nStart with a maximum of 1 iteration/fitting step for the optimizer and _solvertype_ = 1. Choose to plot the y-axis on a log scale using either ggplot or plotly. Run the simulation. 

\nSince you only do a single iteration, nothing is really optimized. We are just doing this so you can see the time-series produced with the starting conditions we picked for the parameters. Notice that the virus load predicted by the model and the data are already fairly close. Look at the SSR value reported underneath the plot. It should be 3.25. As we fit, if things work ok, this value will go down, indicating an improved fit.

\nNow choose _iter_ = 20, i.e., fit for 20 iterations. Run the simulation. Look at the results. The plot shows the final fit. The model-predicted virus curve will be closer to the data. Record the SSR value; it should have gone down indicating a better fit. Also, printed below the figure are the values of the fitted parameters at the end of the fitting process. They should differ from the values you started with."
nrec = 1 # number of items to record
out_records = c("SSR after 20 interations (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 2
#########################
tid = tid + 1
tasktext = "Repeat the same process, now fitting for 40 iterations. You should see some more improvement in the SSR, meaning a further improvement in fit. A fitting step or iteration is essentially a try of the underlying code to find the best possible model. Increasing the tries/iterations usually improves the fit until the solver reached a point where it can't improve further. This is the best fit, though see the next tasks for a big caveat. In practice, one should not specify a fixed number of iterations; that is just done here so things run reasonably fast. Instead, one should ask the solver to run as long as it takes until it can't find a way to further improve the fit (in our case can't further reduce the SSR). The technical expression for this is that the solver has converged to the solution. In this example, it happens quickly. Keep increasing iterations until you find no further reduction in SSR."
nrec = 1 # number of items to record
out_records = c("SSR after 40 interations (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 3
#########################
tid = tid + 1
tasktext = "The goal is to find the best fit, the one with the lowest SSR posssible (for a given model and data set). While in theory, there is always one (or maybe more than one) best fit, for the kind of models we are fitting here, it is often hard to numerically find this best fit (that's a big difference compared to many standard statistical models where the numeric routine is fairly straightforward). 

\nTo explore this, start with the same input values as for task 1, run the optimizer for 40 steps but now use solver/optimizer type 2. Then, repeat for type 3. You should find that solver 2 is at an SSR after 40 steps that's above solver 1 while solver 3 found a value that's below what type 1 is able to find."
nrec = 1 # number of items to record
out_records = c("SSR after 40 interations, solver type 3 (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 4
#########################
tid = tid + 1
tasktext = "We just saw that after a few iterations, different numerical solvers/optimizers obtain different results. That's maybe not surprising and we would not care much about that (apart from trying to find the fastest one), as long as they can all find the same best fit given enough time/iterations. Unfortunately, this doesn't always happen. Let's, explore this. Run each solver for 1000 iterations, then 2000. Depending on the speed of your computer, this might take a while, so be patient. You should find that each solver produces the same SSR for 1000 and 2000 steps, so each solver has converged (reached what it thinks is the overall best fit). Unfortunately, the solvers don't agree on what the best fit is. In this example, solver 2 performs best, with an SSR slightly below 1. This illustrates the challenge when fitting the kind of models we have here, namely that we often have to explore different solvers (and starting conditions, we'll get there) to be reasonably sure we can find the overall best fit. This general problem gets worse the more parameters you try to fit."
nrec = 1 # number of items to record
out_records = c("SSR after 1000 interations, solver type 2 (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 5
#########################
tid = tid + 1
tasktext = "To explore the role that different starting values for the estimated parameters can have, set _p_=0.01, _b_=0.01, and _d~V~_=10. Run a single iteration with solver 1. SSR should be 3.54, similar but not the same as for the start of task 1. 

\nNow, run each of the 3 solvers for 1000 iterations. You'll find that all of them are able to find better fits than previously. Note that the scientific question has not changed: the model and the data are the same and we are still trying to find the combination of parameter values that give the best agreement between model and data. All we have changed is giving the optimizers a different starting position from which to go and find the best fit. In many basic statistical models, you can start anywhere and always arrive at the same answer. This is not the case here. For the kind of fitting problem we usually encounter with our simulation models, optimizers might not find the overall best fit, even if you run them until they converge. What can happen is that a solver converges to a local optimum, which is a combination of parameters that produces a good result, and, as the solver tries to change the parameter values, each result is worse; so, it determines it found the best fit. Unfortunately, there might exist a completely different combination of parameters that give an even better fit, but the solver can't find that solution. It is stuck in a local optimum. Many solvers, even so-called 'global' solvers, can get stuck. Unfortunately, we never know for certain if we found the best fit, and there is no one solver type that is best for all problems. So, in practice, what one needs to do is try different starting values and different solver/optimizer routines, and if many of them find a best fit with the lowest SSR, it's quite likely (though not guaranteed) that we found the overall best fit (lowest SSR)."
nrec = 1 # number of items to record
out_records = c("SSR after 1000 interations, solver type 2 (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 




#########################
# Task 6
#########################
tid = tid + 1
tasktext = "One major consideration when fitting these kind of mechanistic models to data is the balance between data availability and model complexity. The more and richer data one has (e.g., measurements for multiple model variables) the more parameters one can relibably estimate and therefore the more detailed a model can be. If one tries to ask too much from the data, it leads to the problem of overfitting, i.e., trying to estimate more parameters than can be robustly estimated for a given dataset (this is also known as the identifiability problem). This can show itself in the problems we've seen above, the solvers getting stuck in various local optimums, or finding more or less the same SSR for large ranges of parameters. While there are mathematical ways to test if a model can be estimated given the data, this only works for simple problems and doesn't consider the fact that most data are noisy. The most general, and usually best approach to try and safeguard against overfitting is to test if our model can in principle recover estimates in a scenario where parameter values are known. To do so, we can use our model with specific parameter values and simulate data. We can then fit the model to these simulated data. If everything works, we expect that, ideally independent of the starting values for our solver, we end up with estimated best-fit parameter values that agree with the ones we used to simulate the artificial data. We'll try this now with the app.

Set everything as in task 1 (i.e., hit the _Reset Inputs_ button). Then set _psim_ = 0.001, _bsim_ = 0.1, and _dVsim_ = 1, i.e., to the same values as the values used for starting the fitting routine (_p_/_b_/_dV_). Set iteration to 1, solver type 1, y-axis on log scale. Run the simulation. It should be the same as in task 1. Take a close look at the data in the plot. Then set _usesimdata_ = 1, run again for 1 fitting step. You should now see that the data has changed. Instead of the real data, we now use simulated data. Since the parameter values for the simulated data and the starting values for the fitting routine are the same, the time-series is on top of the data and the SSR is (up to rounding errors) 0."

nrec = 1 # number of items to record
out_records = c("TRUE or FALSE: If you set the starting values for your model fitting routine to those used to create the data, you'll get a perfect fit.")
out_types = rep("Logical",nrec)
out_notes = rep("Report TRUE or FALSE",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 




#########################
# Task 7
#########################
tid = tid + 1
tasktext = "Now we can test if we have a problem with overfitting (trying to estimate too many parameters given the data) by choosing starting values for our fitting routine that are different from those we used to produce the simulated data. We hope that after enough iterations, the solver will reach a best fit with SSR close to 0 and estimates for the parameters that agree with the ones we used to produce the data. 

\nSet _p_ = 0.01, _b_ = 0.01, _d~V~_ = 2. Don't change the _psim_/_bsim_/_dVsim_. Run simulation for 1 iteration. You'll see, as expected, a mismatch between model and simulated data with an SSR = 2.2. 

\nNow, run each solver for 500 iterations. You should find that solver 2 finds a best fit with SSR approximately 0 and the estimated parameters are those we used to generate the data. Solvers 1 and 3 can't quite find it. (You can try more iterations, but it seems those solvers are stuck. They might find the best fit with different starting conditions). This means that in this case, we seem to be able to estimate all parameters (as long as we play around with different solvers and starting conditions). You can play around with the values for the data simulation parameters and different starting values for the fitting routine. You'll likely find that sometimes your solver can accurately determine the parameter values, sometimes not. If it works for at least some starting values and some solvers, it means you should be able to estimate your model parameters. Of course with the caveats described above, i.e., it might require a good bit of testing/searching.

\nSince real data are always noisy, there is an option to add noise to your simulated data by choosing a non-zero value for _noise_. Explore a bit what happens if you do that. The more noise you add, the more disagreement will you get between best fit estimates for the parameter values and what you used to simulate the data. This is to be expected, since your data is now not coming straight from the simulation, but instead gets modified by noise."

nrec = 1 # number of items to record
out_records = c("SSR after 500 interations, solver type 1 (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 8
#########################
tid = tid + 1
tasktext = "So far, we haven't talked much about the parameter bounds. In theory, you shouldn't need any bounds. If your model is a good one, you just start with some parameter values and your optimizer will find the best fit. As you already saw, in practice it's not that easy. If you don't give the optimizer any guidance as to what parameter values are reasonable, it might try all kinds of unrealistic values. This can slow down the fitting, it can make it more likely you get stuck in a local optimum, or your underlying simulation function fails because the optimizer is asking it to run the model for nonsensical values. All of these problems suggest that giving bounds for parameters is a good idea. Let's explore the impact of bounds briefly. 

\nReset all inputs (that means we are switching back to the real data, i.e. _usesimdata_ = 0). Then set _p = b = d~V~ = 0.1_ as starting values. Run a single iteration to see that with these starting values, the model is far from the data (SSR = 91). Then, run 1000 interations of solver 2, you should find an okay fit with an SSR = 4.12. 

\nNow, change lower bounds to 1e-6 and upper bounds to 1e6 for all 3 parameters. Re-run the fitting. You'll find a much poorer fit with unreasonable parameter estimates. This is an example of how bounds help find the best fit. You can explore the impact of bounds further by changing their values, the starting values, the number of iterations and solver type. You'll find that sometimes wide bounds are okay and don't impact the fitting much, but, other times, things don't work with wide bounds.

\nOne problem can arise when the best-fit value reported from the optimizer is the same as the lower or upper bound for that parameter. This likely means if you widen the bounds the fit will get better. However, the parameters have biological meanings and certain values do not make sense. For instance a lower bound for the virus decay rate of 0.001/day would mean an average virus lifespan of 1000 days or around 3 years, which is not reasonable for flu *in vivo*. That means if your best fit happens at the bounds, or in general for parameter values that make no biological sense, then your underlying model needs to be modified."

nrec = 1 # number of items to record
out_records = c("SSR after 1000 interations, solver 2, wide bounds (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 9
#########################
tid = tid + 1
tasktext = "The following point is not directly related to the whole fitting topic but worth addressing: without much comment, I asked you to set the unit conversion factor to 0. That essentially means that we think this process of virions being lost due to entering infected cells is negligible compared to clearance of virus due to other mechanisms at rate _d~V~_. 

While that unit conversion factor shows up in most apps, it is arguably not that important if we explore our model without trying to fit it to data. But here, for fitting purposes, this might be important. The experimental units are TCID50/mL, so in our model, virus load needs to have the same units. To make all units work, _g_ needs to have those units, i.e., needs to convert from infectious virions at the site of infection to experimental units. Unfortunately, how one relates to the other is not quite clear. (See [@handel07] for a discussion of that.) If you plan to fit models to data you collected, you need to pay attention to units and make sure what you simulate and the data you have are in agreement. In general, we might probably want to fit _g_. Here, we briefly explore how things change if it's not 0. Reset all inputs, then set _g=1_. Run a single iteration. You'll find a very poor fit (SSR=786). 

Run solver 2 for 1000 iterations. You'll find something that looks better, but not great. Play around with the starting values for the fitted parameters (_p_/_b_/_dV_) to see if you can get an ok looking starting simulation. This is often required when you are trying to fit, since if you start with parameter values that place the model very far away from the data, the solvers might never be able to get close. Once you have a somewhat decent starting simulation, try the different solvers for different iterations and see if you can improve. One useful approach for fitting is to run for some iterations and use the reported best-fit values as new starting conditions, then do another fit with the same or a different solver. The best fit I was able to find was an SSR of a little bit above 4. You might be able to find something better. Since the fit is not good, it suggests the rest of the model (either the model structure, or choices for fixed parameters) is not good and requires tweaking."

nrec = 1 # number of items to record
out_records = c("SSR after 1000 interations, solver type 2 (round to 2 digits).")
out_types = rep("Rounded_Numeric",nrec)
out_notes = rep("Round to two significant digits",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 




#########################
# Task 10
#########################
tid = tid + 1
tasktext = "Keep exploring. Fitting these kind of models can be tricky at times, and you might find strange behavior in this app that you don't expect. Try to get to the bottom of what might be going on. This is an open-ended exploration, so I can't really give you much guidance, other than to re-emphasize that fitting mechanistic simulation models is much trickier than fitting more standard (e.g. generalized linear) models. Just try different things, try to understand as much as possible of what you observe."
nrec = 1 # number of items to record
out_records = c("Nothing")
out_types = rep("None",nrec)
out_notes = c("")
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 
#save the fully filled task table to a tsv file
alloutcomes$QuizID = paste0(packagename,"_",appsettings$appid)
alloutcomes$AppTitle = appsettings$apptitle
alloutcomes$AppID = appsettings$appid
#remove a few variables from the data frame
savedoutcomes <- dplyr::select(alloutcomes,QuizID,AppID,AppTitle,TaskID,TaskText,RecordID,Record,Type,Note)     
write.table(savedoutcomes, paste0(appsettings$appid,"_tasktable.tsv"), append = FALSE, sep = "\t", row.names = F, col.names = TRUE)
# Take all the text stored in the table and print the tasks and items to record
write_tasktext(alloutcomes)

Further Information {#shinytab4}

For this app, the underlying function running the simulation is called r appsettings$simfunction. That function repeatedly calls r appsettings$underlying_function.

This app (and all others) are structured such that the Shiny part (the graphical interface you are using) calls one or several underlying R functions which run the simulation for the model of interest and return the results. You can call them directly, without going through the shiny app. Use the help() command for more information on how to use the functions directly. If you go that route, you need to use the results returned from this function and produce useful output (such as a plot) yourself.

You can also download all simulator functions and modify them for your own purposes. Of course to modify these functions, you'll need to do some coding.

For examples on using the simulators directly and how to modify them, read the package vignette by typing vignette('DSAIRM') into the R console.

A good source for fitting models in R is [@bolker08]. Note though that the focus is on ecological data and ODE-type models are not/barely discussed. This book [@hilborn97] has nice explanations of data fitting, model comparison, etc. but is more theoretical. Lot's of good online material exists on fitting/inference. Most of the material is explained in the context of static, non-mechanistic, statistical or machine learning models, but a lot of the principles apply equally to ODEs. A discussion of overfitting (also called 'identifiability problem') for ODEs is [@miao11a]. Advanced functionality to fit stochastic models can be found in the pomp package in R. (If you don't know what stochastic models are, check out the stochastic apps in DSAIRM.)

References



Try the DSAIRM package in your browser

Any scripts or data that you put into this service are public.

DSAIRM documentation built on Aug. 24, 2023, 1:07 a.m.