Flu Fit Documentation
In DSAIDE: Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)

#load various variable definitions that are the same for each app
source('startup_script.R')
currentrmdfile = knitr::current_input() 
appsettings = get_settings(currentrmdfile,appdocdir,packagename)

Overview {#shinytab1}

This app illustrates how to fit an SIR-type model to data and how to use simulated data to evaluate if it is possible to fit a specific model.

Learning Objectives

Know a basic approach to fitting SIR-type models to data
Understand how to match data to model variables
Start to appreciate the technical difficulties that can arise when fitting

The Model {#shinytab2}

Data

For this app, weekly mortality data from the 1918 influenza pandemic in New York City is used. The data comes from [@mills04]. You can read a bit more about the data by looking at its help file with help('flu1918data').

The data is reported in new deaths per week per 100,000 individuals. Our model (see next section) tracks cumulative, not new deaths. The easiest way to match the two is to add up the weekly reported deaths in the data and compute cumulative deaths for each week. We can then fit that quantity directly to the model variable D. Adjustment for population size is also needed, which is done by dividing the reported death rate by 100,000 and multiplying with the population size. This is further discussed in the tasks.

Alternatively, the model itself can be used to generate artificial data. We can then fit the model to this model-generated data. This is useful for diagnostic purposes, as you will learn by going through the tasks for this app.

Simulation Model

The underlying model that is being fit is a version of the basic SIR model. Since the available data is mortality, we need to keep track of dead individuals in the model, too. This can be achieved by including an additional compartment and letting a fraction of infected individuals move into the dead instead of the recovered compartment.

knitr::include_graphics(here::here('inst/media',appsettings$modelfigname))

The equations for the model are given by

$$ \begin{aligned} \dot S & = -bSI \ \dot I & = bSI - gI \ \dot R & = (1-f)gI \ \dot D & = fgI \end{aligned} $$

Since the individuals in the R compartment are not tracked in the data and do not further influence the model dynamics, we can ignore them here and can implement the model without the R compartment, i.e., the simulation runs these equations.

$$ \begin{aligned} \dot S & = -bSI \ \dot I & = bSI - gI \ \dot D & = fgI \end{aligned} $$

Model Fitting

The app fits the model by minimizing the sum of square residuals (SSR) between model predictions for cumulative deaths and the cumulative number of reported deaths for all data points, i.e.

$$ SSR= \sum_t (D_t - D^{data}_t)^2 $$ where the sum runs over the times at which data was reported.

It is also possible to set the app to fit the difference between the logarithm of data and model, i.e. $$ SSR= \sum_t (\log(D_t) - \log(D^{data}_t))^2 $$

The choice to fit the data or the log of the data depends on the setting. Sometimes one approach is more suitable than the other. In this case, both approaches might be considered reasonable. The choice is a scientific one.

The app reports the final SSR for the fit. This is the lowest number (smallest discrepancy) between the data and the model predictions that the fitting routine was able to achieve.

While minimizing the sum of square difference between data and model prediction is a very common approach, it is not the only one. A more flexible formulation of the problem is to define a likelihood function, which is a mathematical object that compares the difference between model and data based on assumption about the processes that might have led to the observed data. The likelihood has its maximum for the model settings that most closely describes the data. Under certain assumptions, maximizing the likelihood and minimizing the sum of squares are the same problem. Many modern approaches, both frequentist and Bayesian, use the likelihood. So if you want to learn more about fitting SIR-type models to data, learning more about the likelihood and approaches based in it is a good idea. However, this goes beyond the goals of this app (and all current data fitting related apps included in DSAIDE). Interested readers are recommended to look further into this topic, I provided a few pointers in the resources section.

Computer routines for fitting

A computer routine does the minimization of the sum of squares. Many such routines, generally referred to as optimizers, exist. For simple problems, e.g., fitting a linear regression model to data, any of the standard routines work fine. For the kind of minimization problem we face here, which involves a differential equation, it often makes a difference what numerical optimizer routine one uses. R has several packages for that purpose. In this app, we make use of the optimizer algorithms called COBYLA, Nelder-Mead and Subplex from the the nloptr package. This package provides access to a large number of optimizers and is a good choice for many optimization/fitting tasks. For more information , see the help files for the nloptr package and especially the nlopt website.

For any problem that involves fitting ODE models to data, it is often important to try different numerical routines and different starting points to ensure results are consistent. This will be discussed a bit in the tasks.

Another feature that is often good to have is the ability to specify lower and upper bounds for parameters. In theory, if the model is a decent approximation of the underlying real system, then the best fit of the model should happen for biologically reasonable parameter values and in theory providing bounds is not necessary. However, in practice, having bounds is very useful. First, even if the fitting routine would eventually end up with good parameter estimates, it might on the way to getting there try unreasonable values. For instance most of our parameters need to be positive, sometimes they need to be between 0 and 1, and sometimes combinations of parameters can't be crazy (e.g. we don't want a combination of transmission rate and recovery rate that would lead to an unrealistically large reproductive number). With any of those unreasonable choices, the code running the differential equation model might 'blow up' and the fitting might fail. Providing bounds for parameters solves that problem. Another issue might arise that for the best fit, one of the parameters takes on the value of the lower or upper bound. This means if you had chosen the bounds wider, you might get an even better fit. You can try adjusting bounds. At some point the bounds might get unreasonable. If you run into this situation, where the best fit happens for parameter values that are at the bounds or biologically unreasonable, it either means you are trying to estimate more parameters than your data allows, or your model is not properly capturing the real system and is thus mis-specified and needs changing.

A side note: Using a Bayesian approach is an alternative to providing constraints for parameters, and it is in some sense a more disciplined way. The problem is that fitting differential equation models (or stochastic equivalents) in a Bayesian framework is still computationally very expensive and often would take too long to be a reasonable option. Though fitting software is getting more powerful and thus it will become more and more feasible.

What to do {#shinytab3}

Since the data is in weeks, we also run the model in units of weeks.

#this is the running counter for the records which starts at 1 
rc=1

#empty object, will hold all outcomes
alloutcomes = NULL

#########################
# Task 1
#########################
tid = 1
tasktext = "The population of New York City in 1918 was roughly around 5*10^6^ individuals. Use that number as your initial starting value for the susceptibles. Set infected individuals to 1, no initial dead. The model parameters, _b_, _g_, and _f_, are being fit. Even though they are being fit/estimated, we need to provide starting values for the fitting routine. Set the starting value for the infection rate to _b=1e-6_, set the initial recovery rate such that it corresponds to an infectious period of one week, and start with the assumption that one percent of infected individuals died.

For each fitted parameter, choose lower and upper bounds that are 100 times lower/higher than the starting value for each parameter. In general, the lower/upper bound has to be lower/higher or equal to the starting value, otherwise you will get an error message when you try to run the model. For now, we ignore the option to generate simulated data. Therefore set usesimdata to 0 and the parameter values for simulated data can be arbitrary, they will be ignored. The noise parameter will also be ignored. We'll fit the data 'as is' (not the log-transformed data), so set logfit to 0. Start with a maximum of 1 fitting step/iteration and solver type 1. Random number seed at 100. We need to set the seed for reproducibility since we'll create some random numbers below and some of the fitting routines involve random numbers. Run the simulation. Since you only do a single iteration, nothing is optimized. We are just doing this so you can see the time-series produced with these starting conditions. The 'best fit' parameter values are the same you started with. To that end, with that choice, your SSR should be 7.3*10^12^.

To see the (mis)match between data and model, you will probably want to look at plots with the y-axis both on the linear and log scale."
nrec = 1 # number of items to record
out_records = c("Nothing")
out_types = rep("None",nrec)
out_notes = c("")
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 2
#########################
tid = tid + 1
tasktext = "Set the maximum number of iterations to 20, and re-run the simulation. All of the other settings should be the same from task 1. Look at the results. The plot shows the final fit. The model-predicted curve for deaths will be closer to the data. Also, the SSR value should have gone down, to 1.2E12, indicating a better fit. Also printed below the figure are the values of the fitted parameters at the end of the fitting process. Set the iterations to 200, and re-run the simulation. You should see further improvement in SSR, to around 1.6E11. That indicates the previous fit was not the best fit. (The best fit is the one with the lowest possible SSR)."
nrec = 2 # number of items to record
out_records = c("Estimate for parameter f (fraction dying) after 20 iterations, solver 1",
            "Estimate for parameter f (fraction dying) after 200 iterations, solver 1")
out_types = rep("Numeric",nrec)
out_notes = rep("Round to two significant digits, report as non-scientic notation (as X.YZ)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 3
#########################
tid = tid + 1
tasktext = "Repeat the fits for 1, 20 and 200 iterations using solver/optimizer type 2 (solvertype = 2), then do it again for type/number 3. Keep all other settings as before. You will notice that the different solvers do not give the same results. Depending on the speed of your computer, you can try to increase the number of iterations and see what best fit (smallest SSR) you can achieve. 

Theoretically, there is a single best fit that all solvers/optimizers should find if you run them for enough iterations. Unfortunately, that's often not true in practice. For some fitting problems, e.g. fitting a linear model, the type of numerical optimizer routine that is used doesn't matter, as long as you run it for enough iterations you'll find the best fit. Unfortunately, for fitting ODE type models, that is rarely the case. Thus it is important to try multiple optimizers (and multiple starting values). By trying different numerical routines and different starting values, and running each for many iterations, you might be able to find the best overall fit (but there is unfortunately never a guarantee). 

Generally, with increasing iterations, the fits get better. A fitting step or iteration is essentially an attempt by the underlying code to find the best possible model. Increasing the tries usually improves the fit. In practice, one should not specify a fixed number of iterations. We do it here, so things run reasonably fast. Instead, one should ask the solver to run as long as it takes until it can't find a way to improve the fit (can't further reduce the SSR). The technical expression for this is that the solver has converged to the solution. This can be done with the solver used here (`nloptr` R package), but it would take too long, so we implement a hard stop after the specified number of iterations."

nrec = 2 # number of items to record
out_records = c("Estimate for parameter f (fraction dying) after 200 iterations, solver 2",
            "Estimate for parameter f (fraction dying) after 200 iterations, solver 3")
out_types = rep("Numeric",nrec)
out_notes = rep("Round to two significant digits, report as non-scientic notation (as X.YZ)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 



#########################
# Task 4
#########################
tid = tid + 1
tasktext = "As mentioned, when fitting difficult models like ODE models to data, it can be hard to find the best fit. Trying different solvers is useful. Another important consideration are the starting values for the fitted parameters. Let's explore how starting values matter. For a single iteration (no fitting), with the starting conditions we have chosen so far (S=5E6, I=1, D=0, b=1E-6, g=1, f=0.01), we get an SSR=7.3E12. In the previous task, you found that using solver 3 for 200 iterations, you get SSR=2.3E10 (also note the values for the best fit parameters).

Now change the starting values to half a week of infectiousness duration and a 5 percent mortality fraction. Keep the bounds unchanged. Re-run the fit with solver 3 for both 1 and 200 iterations. You should find before/after SSR values of 4.7E12 and 1.8E11. This means the starting values were somewhat better (7.3E12 versus 4.7E12), but it didn't help for the fitting, at least not for 200 steps (2.3E10 versus 1.8E11). 

By trying different starting values, solvers, and number of iterations you can get an idea of the influence starting conditions can have on fitting performance and results. In general, picking good starting values is important. One can get them by trying an initial visual fit or by doing several short fits, and use the best fit values at the end as starting values for a new fit.

Especially if you want to fit multiple parameters, optimizers can 'get stuck'. If they get stuck, even running them for a long time might not find the best fit. One way an optimizer can get stuck is when a solver finds a local optimum. The local optimum is a good fit, and now as the solver varies parameters, each new fit is worse, so the solver 'thinks' it found the best fit, even though there are better ones further away in parameter space. Many solvers - even so-called 'global' solvers - can get stuck. Unfortunately, we never know if the solution is real or if the solver is stuck in a local optimum. One way to figure this out is to try different solvers and different starting conditions, and let each one run for a long time. If all return the same answer, no matter what type of solver you use and where you start, it's quite likely (though not guaranteed) that we found the overall best fit (lowest SSR).

The problem of 'getting stuck' is something that frequently happens when trying to fit ODE models, which is in contrast to fitting more standard models (e.g., a linear regression model), where it is not a problem. The technical reason for this is that a simple regression optimization is _convex_ while fitting an ODE model is usually not. That's why you don't have to worry if you found the right solution if you use the `lm` or `glm` functions for fitting in `R`. When fitting more complicated models such as ODE or similar models, you do have to carefully check that the 'best fit' is not the result of a local optimum."

nrec = 2 # number of items to record
out_records = c("Estimate for parameter f (fraction dying) after 200 iterations, solver 3 for initial starting values",
            "Estimate for parameter f (fraction dying) after 200 iterations, solver 3 for second set of starting values")
out_types = rep("Numeric",nrec)
out_notes = rep("Round to two significant digits, report as non-scientic notation (as X.YZ)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 

#########################
# Task 5
#########################
tid = tid + 1
tasktext = "When fitting models to data, there needs to be a balance between the amount of available data and the number of model parameters you want to estimate. If you 'ask too much' from the data, it leads to the problem of overfitting. Overfitting can be thought of as trying to estimate more parameters than can be robustly estimated for a given dataset. One way to check if overfitting might be a problem is to run the model with known values of the parameters you want to fit, then use the model result to generate artificial data, and use that generated data for fitting. If everything works, we expect that - ideally independent of the starting values for our solver - we end up with estimated best-fit parameter values that agree with the ones we used to simulate the artificial data. If we don't, it's an indication of something gone wrong (e.g. bad starting conditions) or that the data are just not sufficient to reliably estimate all the parameters we want to determine. 

We'll try this now with the app. Set everything as in task 1. Set the parameter values _bsim_, _gsim_, and _fsim_ to the same values as the values used for starting the fitting routine (_S_ = 5E6, _I_ = 1, _D_ = 0, _b_ = _bsim_ = 1E-6, _g_ = _gsim_ = 1, _f_ = _fsim_ = 0.01). Set _usesimdata_ to 1, keep _noise_ at 0. Run for 1 fitting step, solver 3. You should now see that the data has changed. Instead of the real data, we now use simulated data. Since the parameter values for the simulated data and the starting values for the fitting routine are the same, the time-series is on top of the data, and the SSR is (up to rounding errors) 0. 

Now let's see if the fitting routine can recover parameters from a simulation if we start with different initial/starting values. Choose as values for simulated data parameters _bsim_ = 5E-7, _gsim_ = 0.5 and _fsim_ = 0.02. Keep everything else unchanged. Importantly, the starting values for the parameters _b_ and _g_ are now different than the values used for the simulation. Fit to the simulated data, run for 1 iteration. You'll see the (simulated) data change again. The SSR should be 1.3E10. If you now run the fitting for many iterations/steps, what do you expect the final fit values for the parameters and the SSR to be? Test your expectation by running solver 2 for 200 steps. You can also try different iteration steps and different solvers.

If things work ok, we should expect that with enough iterations, the estimated values for the parameters are the same as those used to generate the artificial data. That seems to be the case here. That indicates that you can potentially estimate these 3 model parameters with the available data, at least if there is no noise. This is the most basic test. If you can't get the best fit values to be the same as the ones you used to make the data, it means you are trying to fit more parameters than your data can support, i.e., you are overfitting. At that point, you will have to either get more data or reduce your fitted parameters. Reducing fitted parameters can be done by either fixing some parameters based on biological a priori knowledge or by reducing the number of parameters through model simplification."

nrec = 1 # number of items to record
out_records = c("Estimate for parameter f (fraction dying) after 200 iterations, solver 2")
out_types = rep("Numeric",nrec)
out_notes = rep("Round to two significant digits, report as non-scientic notation (as X.YZ)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 


#########################
# Task 6
#########################
tid = tid + 1
tasktext = "To make things a bit more realistic and harder, one can also add noise on top of the simulated data. Try that by playing with the 'noise added' parameter and see how well you can recover the parameter values for the simulation. Keep values as before, with a mismatch between parameter values for the simulated data, and starting values for the fitting routine (_S_ = 5E6, _I_ = 1, _D_ = 0, _b_ =  1E-6, _g_ = 1, _f_ = 0.01, _bsim_ = 5E-7, _gsim_ = 0.5, _fsim_ = 0.02). Now we'll also add noise of strength 0.1 (this is uniformly distributed random numbers scaled by the parameter values. See the underlying code for details, which are not that important here.) Run solver 2 for 200 steps. Now the SSR is not close to zero (it should be approximately 1E8) and the estimated parameters are not exactly those that are used to run the simulation, since we perturbed the simulated data with some noise. But they should be close. If you are able to get values back that are close to the ones you simulated the data, it's a good sign you might be able to estimate all your parameters using the real data.

Play around with different values for the parameters used to generate artificial data, and different values for the starting conditions and see if you find scenarios where you might not be able to get the solver to reach best fit values that agree with the one you started with. You will likely find that for certain combinations of simulated data, noise added, and specific starting conditions, you might not get estimates that are close to those you used to create the data. This suggests that even for this simple model with 3 parameters, estimating those 3 parameters based on the available data is not straightforward."
nrec = 1 # number of items to record
out_records = c("Estimate for parameter g after 200 iterations, solver 2")
out_types = rep("Numeric",nrec)
out_notes = rep("Round to two significant digits, report as non-scientic notation (as X.YZ)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 




#########################
# Task 7
#########################
tid = tid + 1
tasktext = "Set everything back as in task 1 (so we'll use the real data again). Run solver 3 for 200 iterations, plot the results with the y-axis on a log scale. You will see that the initial data points are not well fit, later ones seem to be fit better. We are currently fitting the data on a linear scale. That means in the SSR equation, differences between large data and model values - which themselves tend to be larger in magnitude - often dominate. As an extreme example, if you have 2 data points you fit, one at 1E10 and the model predicts 1.1E10, that's a difference of 1E9. The second data point is 1E7, and the model predicts 1E6. This is in some sense a more significant discrepancy, but the difference is only 9E6, much smaller than the 1E9 for the first data point. What this means is that sometimes, it makes more sense to fit data on a log scale. Note that a switch of scales means a different scientific question/problem and the choice of scale should be driven by scientific considerations.

Let's take a look at this. Set everything as in task 1, run a single iteration with solver 3. The result should be familiar. Now set logfit to 1, repeat. The plot should look the same, but the SSR is now computed on the log of the data and the model and should be 40.72. What this means is that if we now run the optimizer, it tries to minimize a different equation (the SSR of the log of the data and model instead of the SSR of the data and model on the original scale). That means we should expect different results. Run solver 3 for 200 steps for logfit=0, then repeat for logfit=1. The final SSR obviously changes, but so does the best fit curve and the estimates for the parameter values.

You can explore all the tasks we went through above now with fitting the log of the data. While the general take-home messages are the same, the details will change. Again, fitting data on either a linear or log scale (or after doing some other transformation) can in the right circumstances all be reasonable approaches, the choice should be made based on the underlying biology/science. A good rule of thumb is that if the data spans several orders of magnitude, fitting on the log scale is probably the better option." 
nrec = 1 # number of items to record
out_records = c("Estimate for parameter f after 200 iterations, solver 3, linear scale",
                "Estimate for parameter f after 200 iterations, solver 3, log scale")
out_types = rep("Numeric",nrec)
out_notes = rep("Round to two significant digits, report as non-scientic notation (as X.YZ)",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 



#########################
# Task 8
#########################
tid = tid + 1
tasktext = "You might have noticed that we fixed the starting values for the model. The chosen values seem rasonable for the number susceptible and dead, but it's probably not the best idea to set the initial number of infected persons at 1. In reality, there were probably already more infected patients at the time the recording of these death data started. I could have written the model so you could fit the initial infected. I didn't since it would overcomplicate the example and likely leading to more overfitting. In general, you have different options for setting any initial value or parameter of a model. You can fix it to some value based on outside scientific knowledge. You can fit it and thus let the data tell you what the value is. Finally, you can manually choose different values and see how well results agree with data (or general knowledge of the system). The important part is the be very clear in your description what you do/did and why.

Let's briefly explore the manual adjustment of the number infected. Set values back to those of task 1. Do a quick run with a single iteration to make sure you get the SSR from task 1. Look at the plot with y-axis on a logarithmic scale. You'll again see that the initial values for deaths between model and data don't match well. Now set the initial number of infected to 1000, re-run a single iteration. You'll notice in the plot that the model predicted deaths moves closer to the data in the first few weeks. Maybe surprisingly, the SSR does not shrink. Can you figure out why? Hint: It has to do with what you learned in the last task. Take another look at the SSR equation and think about the impact of the later (higher value) data points compared to the earlier ones for the total SSR.

Now run solver 3 for 500 iterations with 1 initial infected. Depending on the speed of your computer, you might need to be patient. You should get an SSR of 1.1E10. (What does that mean for all the 'best fits' we found above with fewer interations?) Then run again for the same number of iterations, now with 1000 initial infected. Take a look at the SSR. Based on that value, did this manual adjustment of the initial number of infected improve the model fit or not?

You can also explore if/how things change if you do this with data fitted on a log scale"

nrec = 1 # number of items to record
out_records = c("The fit with 1000 initial infected is better (TRUE/FALSE)")
out_types = rep("Logical",nrec)
out_notes = rep("Report TRUE or FALSE",nrec)
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task 




#########################
# Task 9
#########################
tid = tid + 1
tasktext = "For the previous model fit, and most others we have looked at, take a more careful look at the predicted/best-fit fraction of infected that are dying. If you are not familiar with the 1918 flu, look up the estimated mortality and compare the actual estimates to what our model produces. What do you conclude? 

A plus of the kind of models we use here is that most parameter values have direct biological meaning. That means we can - and need to - compare the best-fit values with reality. In some cases, such as here, those best-fit values are unreasonable. A similar problem can occur if you choose lower and upper bounds for your parameters based on known biology, and the best fit occurs at the bound. Either case indicates that 'something isn't quite right'. If you ruled out overfitting, which can produce such results, it means is that your model doesn't properly capture the underlying processes for realistic parameter values, and thus needs to be modified further. This is good, because if you built your initial model based on what you thought was going on in the system, you have now learned that there is something missing, i.e. you've gained a useful insight into your system. I'm a big proponent of reporting such 'failures' in the literature, it is useful information. Unfortunately, quite often such 'dead ends' are not reported.

You might have noticed by going through these tasks that fitting models can be tricky. It is, especially if you try to fit the kinds of simulation models we are covering here. It's a lot of trial and error and practice. If you want to practice some more, keep exploring this app. You might find strange behavior in this app that you don't expect. Try to get to the bottom of what might be going on. This is an open-ended exploration, try different things and try to understand as much as possible of what you observe.

Going deeper will require some more formal training on this topic, see e.g. the materials listed in the Further Resources section."
nrec = 1 # number of items to record
out_records = c("Nothing")
out_types = rep("None",nrec)
out_notes = c("")
outcomes = data.frame( TaskID = rep(tid,nrec),
                       TaskText = rep(tasktext,nrec),
                      RecordID = paste0('T',tid,'R',(1:nrec)),
                      Record = out_records, 
                      Type = out_types, 
                      Note = out_notes)
alloutcomes = rbind(alloutcomes,outcomes)
rc = rc + nrec #increment record counter by number of outcomes to record for this task

#save the fully filled task table to a tsv file
alloutcomes$QuizID = paste0(packagename,"_",appsettings$appid)
alloutcomes$AppTitle = appsettings$apptitle
alloutcomes$AppID = appsettings$appid
#remove a few variables from the data frame
savedoutcomes <- dplyr::select(alloutcomes,QuizID,AppID,AppTitle,TaskID,TaskText,RecordID,Record,Type,Note)     
write.table(savedoutcomes, paste0(appsettings$appid,"_tasktable.tsv"), append = FALSE, sep = "\t", row.names = F, col.names = TRUE)

# Take all the text stored in the table and print the tasks and items to record
write_tasktext(alloutcomes)

Further Information {#shinytab4}

This app (and all others) are structured such that the Shiny part (the graphical interface you see and the server-side function that goes with it) calls an underlying R script (or several) which runs the simulation for the model of interest and returns the results.
For this app, the underlying function running the simulation is called r appsettings$simfunction. You can call them directly, without going through the shiny app. Use the help() command for more information on how to use the functions directly. If you go that route, you need to use the results returned from this function and produce useful output (such as a plot) yourself.
You can also download all simulator functions and modify them for your own purposes. Of course, to modify these functions, you will need to do some coding.
For examples on using the simulators directly and how to modify them, read the package vignette by typing vignette('DSAIDE') into the R console.
The optimizers/fitting routines used here are from the nloptr package, specifically the ones used here are (1 = COBYLA, 2 = NELDERMEAD, 3 = SBPLX). See the nloptr package and the nlopt website for details.
A good source for fitting models in R is [@bolker08]. Note though that the focus is on ecological data and ODE-type models are not/barely discussed.
This book [@hilborn97] has nice explanations of data fitting, model comparison, etc. but is more theoretical.
Many good online material exists on fitting/inference. Most of the material is explained in the context of static, non-mechanistic, statistical or machine learning models, but a lot of the principles apply equally to ODEs.
A discussion of overfitting (also called 'identifiability problem') for ODEs is [@miao11a].
Advanced functionality to fit stochastic models can be found in the pomp package in R. (If you do not know what stochastic models are, check out the stochastic apps in DSAIDE.)
The data for this study is saved in the data variable flu1918data, you can read more about it by looking at its help file entry help(flu1918data). The publication from which the data comes is [@mills04].
The fitting approach used and described here is based on a frequently used but simple approach of minimizing the sum of squares. Many other alternatives are possible. I think taking a fully Bayesian approach is best, and the available software tools are slowly reaching a level of sophistication that allows for fitting of the kind of models we discuss here. A great introduction to Bayesian statistics/fitting (with an example of an ODE model being fit in the last chapter) is [@mcelreath20].

References

Any scripts or data that you put into this service are public.

DSAIDE documentation built on Aug. 24, 2023, 1:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DSAIDE
Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)

Flu Fit Documentation
In DSAIDE: Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)

Overview {#shinytab1}

Learning Objectives

The Model {#shinytab2}

Data

Simulation Model

Model Fitting

Computer routines for fitting

What to do {#shinytab3}

Further Information {#shinytab4}

References

Try the DSAIDE package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DSAIDE Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)

Flu Fit Documentation In DSAIDE: Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)

Overview {#shinytab1}

Learning Objectives

The Model {#shinytab2}

Data

Simulation Model

Model Fitting

Computer routines for fitting

What to do {#shinytab3}

Further Information {#shinytab4}

References

Try the DSAIDE package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DSAIDE
Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)

Flu Fit Documentation
In DSAIDE: Dynamical Systems Approach to Infectious Disease Epidemiology (Ecology/Evolution)