Overview

The Stochastic Process Model (SPM) was developed several decades ago [1,2], and applied for analyses of clinical, demographic, epidemiologic longitudinal data as well as in many other studies that relate stochastic dynamics of repeated measures to the probability of end-points (outcomes). SPM links the dynamic of stochastical variables with a hazard rate as a quadratic function of the state variables [3]. The R-package, “stpm”, is a set of utilities to estimate parameters of stochastic process and modeling survival trajectories and time-to-event outcomes observed from longitudinal studies. It is a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.

Installation

require(devtools)
devtools::install_github("izhbannikov/stpm")

If you experience errors during installation, please download a binary file from the following url:

Than, execute this command (from R environment):

install.packages("<path to the downloaded r-package stpm>", repos=NULL, type="binary")

Data description

Data represents a typical longitudinal data in form of two datasets: longitudinal dataset (follow-up studies), in which one record represents a single observation, and vital (survival) statistics, where one record represents all information about the subject. Longitudinal dataset cat contain a subject ID (identification number), status (event(1)/no event(0)), time and measurements across the variables. The can handle an infinite number of variables but in practice, 5-7 variables is enough.

Below there is an example of clinical data that can be used in and we will discuss the field later. Longitudinal studies:

##   ID IndicatorDeath Age      DBP      BMI
## 1  1              0  30 80.00000 25.00000
## 2  1              0  32 80.51659 26.61245
## 3  1              0  34 77.78412 29.16790
## 4  1              0  36 77.86665 32.40359
## 5  1              0  38 96.55673 31.92014
## 6  1              0  40 94.48616 32.89139

Vital statistics:

##   ID IsDead   LSmort
## 1  1      1 85.34578
## 2  2      1 80.55053
## 3  3      1 98.07315
## 4  4      1 81.29779
## 5  5      1 89.89829
## 6  6      1 72.47687

Data fields description

Longitude studies
  • ID - subject unique identificatin number.
  • IndicatorDeath - 0/1, indicates death of a subject.
  • Age - current age of subjects.
  • AgeNext - next age of subject he will attend to the survey/exam.
  • DBP, BMI - covariates, here “DBP” represents a diastolic blood pressure, “BMI” a body-mass index.
Survival statistics
  • ID - subject’s unique ID.
  • IsDead - death indicator, 0 - alive, 1 - dead.
  • LSmort - age at death of stopping observations.

Discrete- and Continuous-time models

There are two main SPM types in the package: discrete-time model [4] and continuous-time model [3]. Discrete model assumes equal intervals between follow-up observations. The example of discrete dataset is given below.

library(stpm)
data <- simdata_discr(N=10, ystart=c(80), k=1)
head(data)
##      id xi t1 t2       y1  y1.next
## [1,]  1  0 30 31 80.00000 83.11505
## [2,]  1  0 31 32 83.11505 82.44254
## [3,]  1  0 32 33 82.44254 84.26288
## [4,]  1  0 33 34 84.26288 74.32822
## [5,]  1  0 34 35 74.32822 66.50182
## [6,]  1  0 35 36 66.50182 68.22564

In this case there are equal intervals between t1 and t2 (Age and Age.next).

The opposite is continuous case, in which intervals between observations are not equal. The example of continuous case dataset is shown below:

library(stpm)
data <- simdata_cont(N=5,ystart = c(50))
head(data)
##   id xi       t1       t2       y1  y1.next
## 1  1  0 34.64537 35.88450 48.93623 50.04893
## 2  1  0 35.88450 37.52373 50.04893 43.54601
## 3  1  0 37.52373 40.50541 43.54601 43.71541
## 4  1  0 40.50541 43.31533 43.71541 41.48778
## 5  1  0 43.31533 44.97644 41.48778 44.79841
## 6  1  0 44.97644 46.32351 44.79841 37.87642

Discrete model

In discrete model, we use the following assumptions: \[ \bar{y}(t+1) = \bar{u} + \bar{R} \times \bar{y}(t) + \bar{\epsilon} \] (1) \[ \mu(t) = \mu_0(t) + \bar{b}(t) \times \bar{y}(t) + \bar{Q} \times \bar{y}(t)^2 \] (2)

Where: \[ \mu_0(t) = \mu_0 e^{\theta t} \] \[ \bar{b}(t) = \bar{b} e^{\theta t} \] \[ \bar{Q}(t) = \bar{Q} e^{\theta t} \]

Example

library(stpm)
data <- simdata_discr(N=2000)
#Parameters estimation
pars <- spm_discrete(data)
pars
## $pars1
## $pars1$theta
## [1] 0.082
## 
## $pars1$mu0
## [1] 0.0001255953378
## 
## $pars1$b
## [1] -2.933038626e-06
## 
## $pars1$Q
##                 [,1]
## [1,] 1.826654965e-08
## 
## $pars1$u
## [1] 3.991961635
## 
## $pars1$R
## [1] 0.9501822089
## 
## $pars1$Sigma
## [1] 4.997780501
## 
## 
## $pars2
## $pars2$a
##                [,1]
## [1,] -0.04981779111
## 
## $pars2$f1
##             [,1]
## [1,] 80.13124522
## 
## $pars2$Q
##                 [,1]
## [1,] 1.826654965e-08
## 
## $pars2$f
##             [,1]
## [1,] 80.28441829
## 
## $pars2$b
##             [,1]
## [1,] 4.997780501
## 
## $pars2$mu0
##                 [,1]
## [1,] 7.856687818e-06
## 
## $pars2$theta
## [1] 0.082

Continuous model

\[ \mu(u) = \mu_0(u) + (\bar{m}(u) - \bar{f}(u)^* \times \bar{Q}(u) \times (\bar{m}(u) - \bar{f}(u)) + Tr(\bar{Q}(u) \times \bar{\gamma}(u)) \] (3)

\[ dm(t)/dt = \bar{a}(t) \times (\bar{m}(t) - \bar{f_1}(t)) - 2 \bar{\gamma}(t) \times \bar{Q}(t) \times (\bar{m}(t) - \bar{f}(t)) \] (4) \[ d\bar{\gamma}(t)/dt = \bar{a}(t) \times \bar{\gamma}(t) + \bar{\gamma}(t) \times \bar{a}(t)^* + \bar{b}(t) \times \bar{b}(t)^* - 2 \bar{\gamma}{t} \times \bar{Q}(t) \times \bar{\gamma}(t) \] (5)

Example

library(stpm)
#Reading the data:
data <- simdata_cont(N=100)
head(data)
##   id xi          t1          t2          y1     y1.next
## 1  1  0 48.71323423 51.61380979 81.56853722 81.38800629
## 2  1  0 51.61380979 54.48693801 81.38800629 82.69331064
## 3  1  0 54.48693801 56.67844415 82.69331064 78.73080120
## 4  1  0 56.67844415 58.56543745 78.73080120 79.64379327
## 5  1  0 58.56543745 61.16421601 79.64379327 74.06096143
## 6  1  0 61.16421601 62.93750423 74.06096143 81.39526528
#Parameters estimation:
pars <- spm_continuous(dat=data[,2:6],a=-0.05, f1=80, 
                       Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08, k = 1)
pars
## $a
##       [,1]
## [1,] -0.05
## 
## $f1
##      [,1]
## [1,]   80
## 
## $Q
##                [,1]
## [1,] 2.83213213e-08
## 
## $f
##      [,1]
## [1,]   80
## 
## $b
##      [,1]
## [1,]    5
## 
## $mu0
## [1] 2.000000614e-05
## 
## $theta
## [1] 0.08000000001
## 
## $limit
## [1] FALSE

Coefficient conversion between continuous- and discrete-time models

\[ Q = Q \] \[ \bar{a} = \bar{R} - diag(k) \] \[ \bar{b} = \bar{\epsilon} \] \[ \bar{f1} = -1 \times \bar{u} \times \bar{a^{-1}} \] \[ \bar{f} = -0.5 \times \bar{b} \times \bar{Q^{-1}} \] \[ mu_0 = mu_0 - \bar{f} \times \bar{Q} \times t(\bar{f}) \] \[ \theta = \theta \]

Model with time-dependent coefficients

In previous models, we assumed that coefficients is sort of time-dependant: we multiplied them on to \[e^{\theta t}\]. In general, this may not be the case [5]. We extend this to a general case, i.e. (we consider one-dimensional case):

\[ \bar{a(t)} = par_1 t + par_2 \] - linear function.

The corresponding equations will be equivalent to one-dimensional continuous case described above.

Example

library(stpm)
#Data preparation:
n <- 500
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data[,2:6], 
                        start = list(a = -0.05, f1 = 80, Q = 2e-08, f = 80, b = 5, mu0 = 0.001), 
                        f = list(at = "a", f1t = "f1", Qt = "Q", ft = "f", bt = "b", mu0t= "mu0"))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.04653639696
## 
## [[1]]$f1
## [1] 79.0433917
## 
## [[1]]$Q
## [1] 1.729348342e-08
## 
## [[1]]$f
## [1] 99.51038054
## 
## [[1]]$b
## [1] 3.75
## 
## [[1]]$mu0
## [1] 0.001249983971

Simulation (trajectory projection)

We added one- and multi- dimensional simulation to be able to generate test data for hyphotesis testing. Data, which can be simulated can be discrete (equal intervals between observations) and continuous (with arbitrary intervals).

Discrete-time

The corresponding function is:

simdata_discr(N=100, a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.08, ystart=80, tstart=30, tend=105, dt=1, k=1)

Here:

N - Number of individuals

a - A matrix of kxk, which characterize the rate of the adaptive response

f1 - A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k

Q - A matrix of k by k, which is a non-negative-definite symmetric matrix

f - A vector-function (with length k) of the normal (or optimal) state

b - A diffusion coefficient, k by k matrix

mu0 - mortality at start period of time (baseline hazard)

theta - A displacement coefficient of the Gompertz function

ystart - A vector with length equal to number of dimensions used, defines starting values of covariates

tstart - A number that defines a start time (30 by default)

tend - A number, defines a final time (105 by default)

dt - A time interval between observations.

k - number of dimensions (1 by default)

This function returns a table with simulated data, as shown in example below:

library(stpm)
data <- simdata_discr(N=10, ystart=c(75, 94), k=2)
head(data)
##      id xi t1 t2          y1     y1.next          y2     y2.next
## [1,]  1  0 30 31 75.00000000 75.57777849 94.00000000 92.93744262
## [2,]  1  0 31 32 75.57777849 72.37334305 92.93744262 81.42440622
## [3,]  1  0 32 33 72.37334305 66.24409359 81.42440622 75.78225644
## [4,]  1  0 33 34 66.24409359 64.73618361 75.78225644 81.40334435
## [5,]  1  0 34 35 64.73618361 57.31958608 81.40334435 80.47101445
## [6,]  1  0 35 36 57.31958608 49.85973915 80.47101445 73.23008213

Continuous-time

The correstonding function is:

simdata_cont(N=100, a=-0.05, f1=80, Q=2e-07, f=80, b=5, mu0=2e-05, theta=0.08, ystart=80, tstart=30, tend=105, k=1)

Here:

N - Number of individuals

a - A matrix of kxk, which characterize the rate of the adaptive response

f1 - A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k

Q - A matrix of k by k, which is a non-negative-definite symmetric matrix

f - A vector-function (with length k) of the normal (or optimal) state

b - A diffusion coefficient, k by k matrix

mu0 - mortality at start period of time (baseline hazard)

theta - A displacement coefficient of the Gompertz function

ystart - A vector with length equal to number of dimensions used, defines starting values of covariates

tstart - A number that defines a start time (30 by default)

tend - A number, defines a final time (105 by default)

k - number of dimensions (1 by default)

This function returns a table with simulated data, as shown in example below:

library(stpm)
data <- simdata_cont(N=10)
head(data)
##   id xi          t1          t2          y1     y1.next
## 1  1  0 72.84077415 73.88064023 78.66128757 70.16129802
## 2  1  0 73.88064023 75.38970150 70.16129802 69.19208254
## 3  1  0 75.38970150 78.06740699 69.19208254 69.46322879
## 4  1  0 78.06740699 80.41219847 69.46322879 66.96488916
## 5  1  0 80.41219847 82.39449764 66.96488916 70.84761650
## 6  1  0 82.39449764 84.72271349 70.84761650 63.54851804

Simulation strategies

R-package spm currently offers continuous- and discrete time simulations. Below we describe the simulations in details. In general, the input to each corresponding function: simdata_cont_MD(...) for continuous-time and simdata_discr_MD(...) for discrete-time simulations.

Continuous-time simulation strategies

Step 1

We model observations from a subject (which can be any system in general) and at first, we think that the subject is alive and compute the starting observation time t1 and the next time t2:

t1 = runif(1, tstart, tend) t2 = t1 + 2*runif(1, 0, 1)

Here runif() a random number generator which returns uniformly distributed value. We assume that the t1 as a random value, uniformly distributed from the start time (tstart) to end (tend).

Step 2

Computing y1 (an observed variable) from the previous observation:

if event = False:
  y1 = rnorm(1, ystart, sd0)
} else {
  y1 = y2
}

Here rnorm(...) is a random number generator which returns normally distributed values.

Step 3

In order to compute y2 , we need to compute a survival fuction S based on the equations 3, 4 and 5. We then compare the S to the random number, uniformly distributed. If S is larger than that number, than we assume that the event is happened (death of subject or system failure). Otherwise we compute y2 and proceed to the next iteration:

if S > runif(1, 0, 1) : 
    y2 = rnorm(1, m, sqrt(gamma))
    event = True
    new_subject = True
else if event = False:
  y2 = rnorm(1, m, sqrt(gamma))
  event = False
  new_record = True

Discrete-time simulation strategies

In this case we use equal intervals dt between observations and survival function S is computed directly from \(\mu\) (2):

\(S = e^{-1\mu(t_1)}\)

The rest of the discrete simulation routine is the same as in continuous-time simulation case.

References

[1] Woodbury M.A., Manton K.G., Random-Walk of Human Mortality and Aging. Theoretical Population Biology, 1977 11:37-48.

[2] Yashin, A.I., Manton K.G., Vaupel J.W. Mortality and aging in a heterogeneous population: a stochastic process model with observed and unobserved varia-bles. Theor Pop Biology, 1985 27.

[3] Yashin, A.I. et al. Stochastic model for analysis of longitudinal data on aging and mortality. Mathematical Biosciences, 2007 208(2) 538-551.

[4] Akushevich I., Kulminski A. and Manton K.: Life tables with covariates: Dynamic model for Nonlinear Analysis of Longitudinal Data. 2005. Mathematical Popu-lation Studies, 12(2), pp.: 51-80.

[5] Yashin, A. et al. Health decline, aging and mortality: how are they related? Biogerontology, 2007 8(3), 291-302.