Uncertainty and Sensitivity Analysis

#load various variable definitions that are the same for each app
source('startup_script.R')
currentrmdfile = knitr::current_input() 
appsettings = get_settings(currentrmdfile,appdocdir,packagename)

Overview {#shinytab1}

This app allows exploration of the concept of uncertainty and sensitivity analysis. For this purpose, we use the SIR model with demographics (also used in the stochastic SIR app and model exploration app).

For this app, it is assumed that you've worked through all the ones in the Basics and Reproductive Number sections, as well as the Model Exploration app.

Learning Objectives

The Model {#shinytab2}

Model Overview

The model used here is the SIR model with births and deaths. It is also used and described in the stochastic SIR app. This model tracks susceptibles, infected/infectious and recovered hosts. The following compartments are included:

The included processes/mechanisms are the following:

Model Implementation

The flow diagram for the model implemented in this app is:

knitr::include_graphics(here::here('inst/media',appsettings$modelfigname))

The deterministic model implemented as set of differential equations is given by the following equations:

$$ \begin{aligned} \dot S & = n - bSI - mS + wR\ \dot I & = bSI - gI - mI \ \dot R & = gI - mR - wR \end{aligned} $$

This is almost the same model as the basic SIR model from the introductory app, with the only difference that this model also allows natural births and deaths.

Uncertainty and Sensitivity analysis

Often, for a given system we want to model, we only have rough estimates for the model parameters and starting values. Instead of specifying fixed values (which results in a single time-series), we can instead specify parameter ranges, choose sets of parameter values from these ranges, and run the model for multiple sets of parameters.

The simplest way of specifying parameter ranges is to set an upper and lower bound (based on what we know about the biology of the system) and randomly choose any value within those bounds. We can almost always set bounds even if we know very little about a system. Assume we want to model the duration of the infectious period for some disease in humans. We might not little, but we can still be fairly confident that it's longer than say 1 hour and less than 100 years. That's of course a wide range and we should and usually can narrow ranges further, based on biological knowledge of a given system.

If we are fairly certain that values are close to some quantity, instead of specifying a uniform distribution, we can choose one that is more peaked around the most likely value. Normal distributions are not ideal since they allow negative values, which doesn't make sense for our parameters. The gamma distribution is a better idea, since it leads to only positive values.

To run the model for this app, we need to specify values for the initial conditions and model parameters. Initial conditions and all parameters are sampled uniformly between the specified upper and lower bound, apart from the recovery rate, which is given by a gamma distribution, with user-specified mean and variance. For this teaching app, there is no biological reason for making this parameter different, I just picked one parameter and decided to make it non-uniformly distributed to show you different ways one can implement distributions from which to draw parameter samples.

The way the samples are drawn could be done completely randomly, but that would lead to inefficient sampling. A smarter method exists, known as Latin Hypercube sampling (LHS). It essentially ensures that we sample the full range of possible parameter combinations in an efficient manner. For more technical details, see e.g. [@saltelli2004]. For this app, we use LHS.

Once we specify the ranges for each parameter, the sampling method, and the number of samples, the simulation draws that many samples, runs the model for each sample, and records outcomes of interest. While the underlying simulation returns a time-series for each sample, we are usually not interested in the full time-series. Instead, we are interested in some summary quantity. For instance in this model, we might be interested in the maximum and final number of infected and final number of susceptible. This app records and reports those 3 quantities as I~peak~, I~final~ and S~final~.

Results from such simulations for multiple samples can be analyzed in different ways. The most basic one, called uncertainty analysis only asks what level of uncertainty we have in our outcomes of interest, given the amount of uncertainty in our model parameter values. This can be graphically represented with a boxplot, and is one of the plot options for this app.

In a next step, we can ask 'how sensitive is the outcome(s) of interest to variation in specific parameters' - that part is the sensitivity analysis. When you run the simulations, you essentially do both uncertainty and sensitivity analysis at the same time, it's just a question of how you further process the results. We can graphically inspect the relation between outcome and some parameter with scatterplots. If we find that there is a monotone up or down (or neither) trend between parameter and outcome, we can also summarize the finding using a correlation coefficient. For this type of analysis, using the Spearman rank correlation coefficient is useful, which is what the app produces below the figures.

A note on randomness in computer simulations

This simulation (as well as some of the others) involves sampling. This leads to some level of randomness. In science, we want to be as reproducible as possible. Fortunately, random numbers on a computer are not completely random, but can be reproduced. In practice, this is done by specifying a random number seed, in essence a starting position for the algorithm to produce pseudo-random numbers. As long as the seed is the same, the code should produce the same pseudo-random numbers each time, thus ensuring reproducibility.

What to do {#shinytab3}

First, familiarize yourself with the setup of the app, it looks different from most others. Parameters are not set to specific values. Instead, most parameters have a lower and upper bound. For each simulation that is run, random values for the parameter are chosen uniformly between those bounds. The parameter g does not have a uniform but instead a gamma distribution, you can specify its mean and variance to determine the distribution from which values are sampled. In general, you can choose distributions for parameters based on what makes biological sense. Practically, there is often very little difference between a uniform and some other distribution, I'm showing a different one for g here so you can see that in principle you can pick whatever you want (as long as it makes sense, e.g. you likely don't want a distribution that can produce negative parameter values.)

For the purpose of uncertainty and sensitivity analysis, starting values for variables can be treated like parameters. For this app you can vary the starting values for susceptibles and infected, the initial number of recovered are fixed at 0. For the tasks below, we set lower and upper bounds for the starting values to be the same, meaning we don't let those values vary.

The default outcome plots are boxplots, which show the distribution of the 3 outcomes of interest for the different parameter samples. You can set the number of samples you want to run. Samples are constructed using the latin hypercube sampling (LHS) method to efficiently span the space of possible parameter values. In general, more samples are better, but of course take longer to run.

For the tasks, we assume the model is run in units of years.

#this is the running counter for the records which starts at 1 
rc=1

#empty object, will hold all outcomes
alloutcomes = NULL

#########################
# Task 1
#########################
tid = 1
tasktext = "Start with the pre-set input values. Since sampling of parameters involves uncertainty/randomness, we need to use random numbers. We still want results to be reproducible. That's where the random number seed comes in. As long as the seed is the same, the code should produce the same (pseudo)-random numbers each time, thus ensuring reproducibility. Let's explore this. Leave all settings as they are, run the simulation twice with the same random number seed (value of 100), check to make sure you get exactly the same result twice. Now change the random number seed to 102, run again. You should see the results changed. 

If you have more samples, the results generally become more robust to changes in the underlying sample generation (determined by the random number seed). Try checking this by repeating simulations for the two random seeds, but now run 50 samples for each seed (this may take a while, depending on the speed of your computer).

Pay attention to variability in the different quantities reported below the plot. You should see less variability in the central quantities (mean, median) for the larger sample size, extremes (min/max) will continue to fluctuate more.

In general, the amount by which you change the random seed doesn't matter, the impact it has on outcomes is random. It just so happens that 101 gives a maybe confusing result for 50 samples - try if you want.  

Note that each sample means one simulation of the underlying dynamical model, so as sample numbers increase, things slow down. Also note the _system might not have reached steady state_ message. We are ignoring it for now, but if for too many of the samples steady state has not been reached, the results for _S~final~_ and _I~final~_ do not reflect steady-state values. Increasing the simulation time can help the system reach a steady state (if there is one). For some parameter combinations, that can take very long."

nrec = 4 # number of items to record
out_records = c("Mean number of peak infected, samples = 5, rngseed = 100",
                "Mean number of peak infected, samples = 5, rngseed = 102",
                "Mean number of peak infected, samples = 50, rngseed = 100",
                "Mean number of peak infected, samples = 50, rngseed = 102")
out_types = rep("Rounded_Integer",nrec)
out_notes = rep("Report the rounded integer",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 2
#########################
tid = tid + 1
tasktext = "Recall the underlying SIR model and its behavior. If you don't remember, revisit the apps mentioned in the overview tab that discuss this model. Use your understanding of the model to predict what happens to the plots and the outcomes reported under the plots if you increase both lower and upper bound for the infection rate. Increase lower/upper bounds by a factor of 2. This means you are now sampling b in the range of 0.01 - 0.02. From previous apps, you know that this also implies that you increased R0. Thus, we expect larger outbreaks. Use 50 samples and rngseed=100. Run the simulation with the new ranges of b, see how results change. Now go the opposite way and lower the initial lower/upper bounds by a factor of 2, such that the range is 0.0025 - 0.005. Run the simulation, see how results change. As expected, a lower transmission rate and thus a lower R0 leads to smaller outbreaks."

nrec = 2 # number of items to record
out_records = c("Mean number of peak infected, increased transmission",
            "Mean number of peak infected, decreased transmission")
out_types = rep("Rounded_Integer",nrec)
out_notes = rep("Report the rounded integer",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 3
#########################
tid = tid + 1
tasktext = "This model includes births, deaths and waning immunity. So far, they are turned off since both lower and upper bounds are set to 0. Now let's turn on births and deaths. The presence of births and deaths means the underlying time-series (which we can't see, but could look at in one of the earlier apps), produces an initial large outbreak, followed by a few smaller outbreaks and then settles down to a steady state. Reset everything to the initial values. Then set the range for the birth rate _n_ to 50 - 100, the range for death rate _m_ to 0.05 - 0.1. (If you are paying attention, you might have noticed that the range for the death rate means a natural life-span of 10 - 20 years. That's of course too short for humans. It's chosen here to make the model settle down in a reasonable time-frame. You can think of maybe modeling an animal disease in a host that doesn't live as long.)

Let's explore how the steady state is impacted if we change birth and death rates. Run the model with the specified settings for 50 simulations with random seed at 100. Look at the values at final steady state (and record the one requested).

Now, if we increase births or deaths, which of the outcomes do you expect to change, and in which direction? Test your assumption by increasing the bounds for the birth rate such that the range is now 100 - 200. Re-run the simulation. Then set the birth rate back to the range 50-100 and now increase the bounds for the death rate such that the range is now 0.1 - 0.2. Repeat the simulations. Finally, have both birth and death rates at the increased ranges of 50-100 and 0.1-0.2 respectively, run simulations again. 

You should see that a change in birth rate has essentially no impact on the final steady state of S, while an increase in the death rate leads to an increase in _S~final~_. If you are not sure why you are getting these results, remember what you did in the _ID Patterns_ app (or revisit that app). There you found an equation for the steady state for _S_ for this type of model, given by S=(g + m)/b. This shows that changes in birth rate do not affect the steady state, while changes in death rate do."

nrec = 4 # number of items to record
out_records = c("Mean number of susceptible at end of simulation, baseline",
                "Mean number of susceptible at end of simulation, increased birth",
            "Mean number of susceptible at end of simulation, increased death",
            "Mean number of susceptible at end of simulation, increased birth and death")
out_types = rep("Rounded_Integer",nrec)
out_notes = rep("Report the rounded integer",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 



#########################
# Task 4
#########################
tid = tid + 1
tasktext = "Let's continue with the exploration of the last task, to illustrate an important point. Reset all values, then set ranges for birth rate to 10 - 20 and death rate range 0.1 - 0.2. Run 50 simulations with random seed at 100. Look at the values at final steady state (and record the one requested).

Now increase the range for the birth rate to 20 - 40. Based on the last task, the final number of susceptible should not change. This is not what you will see, instead the value goes up. Can you figure out why this is the case? As a hint, it has to do with the reproductive number. 

Here is a detailed explanation: Remember (from earlier apps) that the reproductive number for this model is R~0~ = b * S~0~/(g+m). You also learned previously that for this kind of model, in the presence of births and deaths, the steady value for S~0~ is S~0~ = n/m. Combining those 2 equations, you get R~0~ = b * n/((g+m) * m). We can explore the range of possible R~0~ values by using the minimum and maximum of the ranges (we'll just use g=0.5 since the variation is fairly small). Plugging in numbers for the baseline gives a range for R~0~ from 0.005 * 10/((0.5+0.2) * 0.2) to 0.01 * 20/((0.5+0.1) * 0.1), which is rougly 0.35 to 3.3. Thus a good number of samples will have R~0~ < 1 and thus the equation for a steady state in the presence of the disease (i.e. an endemic situation) does not apply. As you increase the birth rate, more values move into the R~0~ > 1 range. The computed average mixes both the R~0~ > 1 samples, for which the steady state equation applies, and those with R0<1, for which that equation does not apply. Therefore, the overall results don't quite make sense. (You can check that in the previous task, parameter ranges were such that all R~0~ values were larger than 1.)

This is an important point. When you sample your parameters, you want to make sure that it doesn't put the model into completely different regimes, otherwise you are combining 2 very different situations and the combined results are hard to interpret. To guard against this, it is often good to sample ranges of parameters in pieces, instead of one very wide range. And it's also always good to explore the outcomes in a plot."

nrec = 2 # number of items to record
out_records = c("Mean number of susceptible at end of simulation, baseline",
                "Mean number of susceptible at end of simulation, increased birth")
out_types = rep("Rounded_Integer",nrec)
out_notes = rep("Report the rounded integer",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 



#########################
# Task 5
#########################
tid = tid + 1
tasktext = "Continue exploring by changing ranges for different parameters or initial conditions. It is likely that for some settings you will get warning or error messages. That generally means that the parameters for a given simulation are such that the differential equation solver can't properly run the model. That usually corresponds to biologically unrealistic parameter settings. Obvious silly choices would be ranges that include negative parameter values, but even some non-negative values of individual parameters or combinations of parameters can produce biologically unrealistic outcomes. Thus, when doing any kind of sampling, pay close attention to any warning or error messages from your code, and give your input ranges and results careful attention to make sure nothing unreasonable os happening."

nrec = 1 # number of items to record
out_records = c("Nothing")
out_types = rep("None",nrec)
out_notes = c("")
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 6
#########################
tid = tid + 1
tasktext = "So far, we have changed parameter ranges and explored how that impacts the result. It is often useful to look at the impact of specific parameters on the outcomes of interest in more detail. This is where sensitivity analysis comes in. We run the same simulations, but now instead of plotting outcomes as a boxplot, we produce scatterplots for outcomes as function of each varied parameter. 

Let's explore that. Set values back as in task 1. Switch the plot type from boxplot to scatterplot, choose _b_ as the parameter for the scatterplot, run 50 samples with seed 100. Take a close look at the scatterplots to investigate the relation between the chosen parameter and the various outcomes. To investigate specific parameters, chose them as the output for the scatterplot. Look at the text below the plots. For each parameter-output pair, the code computes a rank correlation coefficient. Numbers close to 0 mean there is essentially no correlation, close to 1 or -1 means a large positive or negative correlation. (One could compute p-values for these correlations, but they are somewhat meaningless since the values will get smaller the more samples you use, so you can basically produce any p-value you want). With more samples, the patterns of correlation are clearer in the plots. Try running the simulation with different sample sizes (50, 100 or even more) to see the impact.
Compare the scatterplots and correlation coefficients with the results you found above, when you explored changes in _b_ and how they affected infected peak. You should see the same finding as before, namely larger values of _b_ lead to a higher peak."

nrec = 1 # number of items to record
out_records = c("The correlation between _b_ and infection peak, 50 samples, rngseed = 100")
out_types = rep("Numeric",nrec)
out_notes = rep("Report rounded to 2 digits (hundredths)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 



#########################
# Task 7
#########################
tid = tid + 1
tasktext = "Now we'll revisit the scenario with births and deaths turned on. Set the range for the birth rate _n_ to 50 - 100, the range for death rate _m_ to 0.1 - 0.2. Choose the plot to be a scatterplot, set _n_ as the parameter to be shown on the scatterplots, run 100 samples, seed 100. From the tasks above, we expect that the final number of susceptible is not affected by different values of _n_. Look at the scatter plot and the correlation coefficient to confirm this. 

Now re-run and look at the scatterplots for parameter _m_. From the equation above, we know that larger _m_ means larger _S~final~_. The correlation we find is positive but small.   

There is a lot of scatter in the data, too much to see clear patterns. One could always increase sample size which should help detect patterns. If your computer is fast enough, run 200 or 500 samples. Another option is to restrict the variability to a subset of parameters, which we'll do next."

nrec = 2 # number of items to record
out_records = c("The correlation between _n_ and the final number of susceptibles, 100 samples, rngseed = 100",
                "The correlation between _m_ and the final number of susceptibles, 100 samples, rngseed = 100")
out_types = rep("Numeric",nrec)
out_notes = rep("Report rounded to 2 digits (hundredths)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 





#########################
# Task 8
#########################
tid = tid + 1
tasktext = "We can get a clearer picture of the impact of a specific parameter on the outcomes by restricting variability in other parameters. To do so, keep everything as you just had, but now set bmin and bmax both to 0.01, thus removing variability in the _b_ parameter. Run the simulation with 100 samples, rngseed=100. You will see less scatter/variation along the y-axis in the scatterplots. Next, reduce gvar to 0.001, re-run. You should see that the scatter keeps reducing and the correlation between _m_ and _S~final~_ is getting stronger. 

Also confirm that there is still no correlation between _n_ and _S~final~_."

nrec = 1 # number of items to record
out_records = c("Correlation between _m_ and final number of susceptibles, bmin=bmax=0.01 and gvar=0.001")
out_types = rep("Numeric",nrec)
out_notes = rep("Report rounded to 2 digits (hundredths)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 
#save the fully filled task table to a tsv file
alloutcomes$QuizID = paste0(packagename,"_",appsettings$appid)
alloutcomes$AppTitle = appsettings$apptitle
alloutcomes$AppID = appsettings$appid
#remove a few variables from the data frame
savedoutcomes <- dplyr::select(alloutcomes,QuizID,AppID,AppTitle,TaskID,TaskText,RecordID,Record,Type,Note)     
write.table(savedoutcomes, paste0(appsettings$appid,"_tasktable.tsv"), append = FALSE, sep = "\t", row.names = F, col.names = TRUE)
# Take all the text stored in the table and print the tasks and items to record
write_tasktext(alloutcomes)

Further Information {#shinytab4}

References



Try the DSAIDE package in your browser

Any scripts or data that you put into this service are public.

DSAIDE documentation built on Aug. 24, 2023, 1:07 a.m.