Basic Model Fit

Overview

This app illustrates how to fit a mechanistic dynamical model to data and how to use simulated data to evaluate if it is possible to fit a specific model.

The Model

Data

For this app, viral load data from patients infected with influenza is being fit. The data is average log viral titer on days 1-8 post infection. The data comes from (Hayden et al. 1996), specifically the ‘no treatment’ group shown in Figure 2 of this paper.

Another source of ‘data’ is by using our simulation to produce artificial data.

Simulation Model

The underlying model that is being fit to the data is the basic virus model used in the app of this name. See that app for a description of the model.

Fitting Model

This app fits the log viral titer of the data to the virus kinetics produced by the model simulation. The fit is evaluated by computing the sum of square errors between data and model for all data points, i.e. \[ SSR= \sum_t (Vm_t - Vd_t)^2 \] where \(Vm_t\) is the virus load (in log units) predicted from the model simulation at days \(t=1..8\) and \(Vd_t\) is the data, reported in those same units (log10) and on those time points. The underlying code varies model parameters to try to get the predicted viral load from the model as close as possible to the data, by minimizing the SSR. The app reports the final SSR for the fit.

For this dataset, there is a lower limit of detection (LOD) for the virus load. To account for this, if the data is at the LOD, we set any model prediction which is below the LOD to the LOD. This means we do not penalize the model if it predicts virus load to be at the LOD or any lower value. This is done before computing the SSR using the equation above.

In general, with enough data, one could fit/estimate every parameter in the model and the initial conditions. However, with just the virus load data available, the data are not rich enough to allow estimation of all model parameters (even for a model as simple as this). The app is therefore implemented by assuming that most model parameters are known and fixed, and only 3, the rate of virus production, p, the rate of infection of cells, b, and the rate of virus death/removal, dV can be estimated. The app also allows to keep some of those parameters fixed, we’ll explore this in the tasks.

While minimizing the sum of square difference between data and model prediction is a very common approach, it is not the only one. A more flexible formulation of the problem is to define a likelihood function, which is a mathematical object that compares the difference between model and data and has its maximum for the model settings that most closely describe the data. Under certain assumptions, maximizing the likelihood and minimizing the sum of squares are the same problem. Further details on this are beyond the basic introduction we want to provide here. Interested readers are recommended to look further into this topic, e.g. by reading about (maximum) likelihood on Wikipedia.

Computer routines for fitting

A computer routine does the minimization of the sum of squares. Many such routines, generally referred to as optimizers, exist. For simple problems, e.g., fitting a linear regression model to data, any of the standard routines work fine. For the kind of minimization problem we face here, which involves a differential equation, it often makes a difference what numerical optimizer routine one uses. R has several packages for that purpose. In this app, we make use of the optimizer algorithms called COBYLA, Nelder-Mead and Subplex from the the nloptr package. This package provides access to a large number of optimizers and is a good choice for many optimization/fitting tasks. For more information , see the help files for the nloptr package and especially the nlopt website.

For any problem that involves fitting ODE models to data, it is often important to try different numerical routines and different starting points to ensure results are consistent. This will be discussed a bit in the tasks.

What to do

The model is assumed to run in units of days.

Task 1

Task 2

Generally, with increasing iterations, the fits get better. A fitting step or iteration is essentially a ‘try’ of the underlying code to find the best possible model. Increasing the tries usually improves the fit. In practice, one should not specify a fixed number of iterations, that is just done here so things run reasonably fast. Instead, one should ask the solver to run as long as it takes until it can’t find a way to further improve the fit (don’t further reduce the SSR). The technical expression for this is that the solver has converged to the solution. This can be done with the solver used here (nloptr R package), but it would take too long, so we implement a “hard stop” after the specified number of iterations.

Task 3

Ideally, with enough iterations, all solvers should reach the best fit with the lowest possible SSR. In practice, that does not always happen, often it depends on the starting conditions. Let’s explore this idea that starting values matter.

Optimizers can ‘get stuck’ and even running them for a long time, they might not find the best fit. What can happen is that a solver found a local optimum. It found a good fit, and now as it varies parameters, each new fit is worse, so the solver “thinks” it found the best fit, even though there are better ones further away in parameter space. Many solvers - even so-called ‘global’ solvers - can get stuck. Unfortunately, we never know if the solution is real or if the solver is stuck in a local optimum. One way to figure this out is to try different solvers and different starting conditions, and let each one run for a long time. If all return the same answer, no matter what type of solver you use and where you start, it’s quite likely (though not guaranteed) that we found the overall best fit (lowest SSR).

Task 4

While that unit conversion factor shows up in most apps, it is arguably not that important if we explore our model without trying to fit it to data. But here, for fitting purposes, this is important. The experimental units are TCID50/mL, so in our model, virus load needs to have the same units. Then, to make all units work, g needs to have those units, i.e. convert from infectious virions at the site of infection to experimental units. Unfortunately, how one relates to the other is not quite clear. See e.g. (Handel, Longini, and Antia 2007) for a discussion of that. If you plan to fit models to data you collected, you need to pay attention to units and make sure what you simulate and the data you have are in agreement.

Task 5

One major consideration when fitting these kind of mechanistic models to data is the balance between data availability and model complexity. The more and “richer” data one has available the more parameters one can estimate and therefore the more detailed a model can be. If one tries to ‘ask too much’ from the data, it leads to the problem of overfitting - trying to estimate more parameters than can be robustly estimated for a given dataset. One way to safeguard against overfitting is by probing if the model can in principle recover estimates in a scenario where parameter values are known. To do so, we can use our model with specific parameter values and simulate data. We can then fit the model to this simulated data. If everything works, we expect that - ideally independent of the starting values for our solver - we end up with estimated best-fit parameter values that agree with the ones we used to simulate the artificial data. We’ll try this now with the app.

Task 6

Let’s see if the fitting routine can recover parameters from a simulation if we start with different initial guesses.

Task 7

If you ran things long enough in the previous task you should have obtained best fit values that were the same as the ones you used to produce the simulated data, and the SSR should have been close to 0. That indicates that you can estimate these parameters with that kind of data. Once you’ve done this test, you can be somewhat confident that fitting your model to the real data will allow you to get robust parameter estimates.

Task 8

Note that since you now change your data after you simulated it, you don’t expect the parameter values for the simulation and those you obtain from your best fit to be the same. However, if the noise is not too large, you expect them to be similar.

Task 9

Further Information

References

Bolker, Benjamin M. 2008. Ecological Models and Data in r. Princeton University Press.
Handel, Andreas, Ira M Longini Jr, and Rustom Antia. 2007. “Neuraminidase Inhibitor Resistance in Influenza: Assessing the Danger of Its Generation and Spread.” PLoS Comput Biol 3 (12): e240. https://doi.org/10.1371/journal.pcbi.0030240.
Hayden, F G, J J Treanor, R F Betts, M Lobo, J D Esinhart, and E K Hussey. 1996. “Safety and Efficacy of the Neuraminidase Inhibitor Gg167 in Experimental Human Influenza.” JAMA 275: 295–99.
Hilborn, Ray, and Marc Mangel. 1997. The ecological detective : confronting models with data. Monographs in Population Biology 28. Princeton, N.J.: Princeton University Press.
Miao, Hongyu, Xiaohua Xia, Alan S. Perelson, and Hulin Wu. 2011. “On Identifiability of Nonlinear ODE Models and Applications in Viral Dynamics.” SIAM Review 53 (1): 3. https://doi.org/10.1137/090757009.