Overview

The Stochastic Process Model (SPM) was developed several decades ago [1,2], and applied for analyses of clinical, demographic, epidemiologic longitudinal data as well as in many other studies that relate stochastic dynamics of repeated measures to the probability of end-points (outcomes). SPM links the dynamic of stochastical variables with a hazard rate as a quadratic function of the state variables [3]. The R-package, “stpm”, is a set of utilities to estimate parameters of stochastic process and modeling survival trajectories and time-to-event outcomes observed from longitudinal studies. It is a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.

Installation

require(devtools)
devtools::install_github("izhbannikov/stpm")

If you experience errors during installation, please download a binary file from the following url:

Than, execute this command (from R environment):

install.packages("<path to the downloaded r-package stpm>", repos=NULL, type="binary")

Data description

Data represents a typical longitudinal data in form of two datasets: longitudinal dataset (follow-up studies), in which one record represents a single observation, and vital (survival) statistics, where one record represents all information about the subject. Longitudinal dataset cat contain a subject ID (identification number), status (event(1)/no event(0)), time and measurements across the variables. The can handle an infinite number of variables but in practice, 5-7 variables is enough.

Below there is an example of clinical data that can be used in and we will discuss the field later. Longitudinal studies:

##   ID IndicatorDeath Age      DBP      BMI
## 1  1              0  30 80.00000 25.00000
## 2  1              0  32 80.51659 26.61245
## 3  1              0  34 77.78412 29.16790
## 4  1              0  36 77.86665 32.40359
## 5  1              0  38 96.55673 31.92014
## 6  1              0  40 94.48616 32.89139

Vital statistics:

##   ID IsDead   LSmort
## 1  1      1 85.34578
## 2  2      1 80.55053
## 3  3      1 98.07315
## 4  4      1 81.29779
## 5  5      1 89.89829
## 6  6      1 72.47687

Data fields description

Longitude studies
  • ID - subject unique identificatin number.
  • IndicatorDeath - 0/1, indicates death of a subject.
  • Age - current age of subjects.
  • AgeNext - next age of subject he will attend to the survey/exam.
  • DBP, BMI - covariates, here “DBP” represents a diastolic blood pressure, “BMI” a body-mass index.
Survival statistics
  • ID - subject’s unique ID.
  • IsDead - death indicator, 0 - alive, 1 - dead.
  • LSmort - age at death of stopping observations.

Discrete- and Continuous-time models

There are two main SPM types in the package: discrete-time model [4] and continuous-time model [3]. Discrete model assumes equal intervals between follow-up observations. The example of discrete dataset is given below.

library(stpm)
data <- simdata_discr(N=10, ystart=80)
head(data)
##      id xi t1 t2       y1  y1.next
## [1,]  1  0 30 31 80.00000 77.97124
## [2,]  1  0 31 32 77.97124 72.75948
## [3,]  1  0 32 33 72.75948 78.30266
## [4,]  1  0 33 34 78.30266 83.51840
## [5,]  1  0 34 35 83.51840 83.69369
## [6,]  1  0 35 36 83.69369 79.87638

In this case there are equal intervals between t1 and t2 (Age and Age.next).

The opposite is continuous case, in which intervals between observations are not equal. The example of continuous case dataset is shown below:

library(stpm)
data <- simdata_cont2(N=5,ystart = 50)
head(data)
##      id xi       t1       t2       y1  y1.next
## [1,]  0  0 37.91828 39.04887 50.99913 51.12754
## [2,]  0  0 39.04887 40.66078 51.12754 55.65004
## [3,]  0  0 40.66078 41.75316 55.65004 63.54484
## [4,]  0  0 41.75316 43.71354 63.54484 67.56044
## [5,]  0  0 43.71354 45.69289 67.56044 60.13357
## [6,]  0  0 45.69289 46.74288 60.13357 67.28961

Discrete model

In discrete model, we use the following assumptions: \[ \bar{y}(t+1) = \bar{u} + \bar{R} \times \bar{y}(t) + \bar{\epsilon} \] (1) \[ \mu(t) = \mu_0(t) + \bar{b}(t) \times \bar{y}(t) + \bar{Q} \times \bar{y}(t)^2 \] (2)

Where: \[ \mu_0(t) = \mu_0 e^{\theta t} \] \[ \bar{b}(t) = \bar{b} e^{\theta t} \] \[ \bar{Q}(t) = \bar{Q} e^{\theta t} \]

Example

library(stpm)
data <- simdata_discr(N=200)
#Parameters estimation
pars <- spm_discrete(data)
pars
## $Ak2005
## $Ak2005$theta
## [1] 0.083
## 
## $Ak2005$mu0
## [1] 8.369318058e-05
## 
## $Ak2005$b
## [1] -2.069622929e-06
## 
## $Ak2005$Q
##                [,1]
## [1,] 1.39043211e-08
## 
## $Ak2005$u
## [1] 4.0695147
## 
## $Ak2005$R
## [1] 0.9497498373
## 
## $Ak2005$Sigma
## [1] 5.033107967
## 
## 
## $Ya2007
## $Ya2007$a
##                [,1]
## [1,] -0.05025016273
## 
## $Ya2007$f1
##            [,1]
## [1,] 80.9851049
## 
## $Ya2007$Q
##                [,1]
## [1,] 1.39043211e-08
## 
## $Ya2007$f
##             [,1]
## [1,] 74.42373178
## 
## $Ya2007$b
##             [,1]
## [1,] 5.033107967
## 
## $Ya2007$mu0
##                 [,1]
## [1,] 6.678649707e-06
## 
## $Ya2007$theta
## [1] 0.083
## 
## 
## attr(,"class")
## [1] "spm.discrete"

Continuous model

\[ \mu(u) = \mu_0(u) + (\bar{m}(u) - \bar{f}(u)^* \times \bar{Q}(u) \times (\bar{m}(u) - \bar{f}(u)) + Tr(\bar{Q}(u) \times \bar{\gamma}(u)) \] (3)

\[ dm(t)/dt = \bar{a}(t) \times (\bar{m}(t) - \bar{f_1}(t)) - 2 \bar{\gamma}(t) \times \bar{Q}(t) \times (\bar{m}(t) - \bar{f}(t)) \] (4) \[ d\bar{\gamma}(t)/dt = \bar{a}(t) \times \bar{\gamma}(t) + \bar{\gamma}(t) \times \bar{a}(t)^* + \bar{b}(t) \times \bar{b}(t)^* - 2 \bar{\gamma}{t} \times \bar{Q}(t) \times \bar{\gamma}(t) \] (5)

Example

library(stpm)
#Reading the data:
data <- simdata_cont2(N=100)
head(data)
##      id xi          t1          t2          y1      y1.next
## [1,]  0  0 35.69370874 37.07987729 80.39139407  83.11449873
## [2,]  0  0 37.07987729 38.25836416 83.11449873  83.27321760
## [3,]  0  0 38.25836416 39.40633274 83.27321760  89.73567427
## [4,]  0  0 39.40633274 40.97531492 89.73567427  94.70807684
## [5,]  0  0 40.97531492 42.72154746 94.70807684  96.47615846
## [6,]  0  0 42.72154746 44.63677688 96.47615846 100.89649668
#Parameters estimation:
pars <- spm_continuous(dat=data,a=-0.05, f1=80, 
                       Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08)
## Parameter theta achieved lower/upper bound.
## 0.072
pars
## $a
##                [,1]
## [1,] -0.05453693777
## 
## $f1
##             [,1]
## [1,] 79.39475939
## 
## $Q
##                 [,1]
## [1,] 2.160177469e-08
## 
## $f
##             [,1]
## [1,] 83.73782074
## 
## $b
##             [,1]
## [1,] 5.025732121
## 
## $mu0
## [1] 1.839933512e-05
## 
## $theta
## [1] 0.072
## 
## $limit
## [1] TRUE
## 
## attr(,"class")
## [1] "spm.continuous"

Coefficient conversion between continuous- and discrete-time models

\[ Q = Q \] \[ \bar{a} = \bar{R} - diag(k) \] \[ \bar{b} = \bar{\epsilon} \] \[ \bar{f1} = -1 \times \bar{u} \times \bar{a^{-1}} \] \[ \bar{f} = -0.5 \times \bar{b} \times \bar{Q^{-1}} \] \[ mu_0 = mu_0 - \bar{f} \times \bar{Q} \times t(\bar{f}) \] \[ \theta = \theta \]

Here \[k\] is a number of variables (covariates), which is equal to model’s dimension.

Model with time-dependent coefficients

In previous models, we assumed that coefficients is sort of time-dependant: we multiplied them on to \[e^{\theta t}\]. In general, this may not be the case [5]. We extend this to a general case, i.e. (we consider one-dimensional case):

\[ \bar{a(t)} = par_1 t + par_2 \] - linear function.

The corresponding equations will be equivalent to one-dimensional continuous case described above.

Example

library(stpm)
#Data preparation:
n <- 500
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data, 
                        start = list(a = -0.05, f1 = 80, Q = 2e-08, f = 80, b = 5, mu0 = 0.001), 
                        f = list(at = "a", f1t = "f1", Qt = "Q", ft = "f", bt = "b", mu0t= "mu0"))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.03904617932
## 
## [[1]]$f1
## [1] 79.36413946
## 
## [[1]]$Q
## [1] 2.22664768e-08
## 
## [[1]]$f
## [1] 100
## 
## [[1]]$b
## [1] 3.750239397
## 
## [[1]]$mu0
## [1] 0.001249916021

Simulation (trajectory projection)

We added one- and multi- dimensional simulation to be able to generate test data for hyphotesis testing. Data, which can be simulated can be discrete (equal intervals between observations) and continuous (with arbitrary intervals).

Discrete-time

The corresponding function is (\[k\] - a number of variables(covariates), equal to model’s dimension):

simdata_discr(N=100, a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.08, ystart=80, tstart=30, tend=105, dt=1)

Here:

N - Number of individuals

a - A matrix of kxk, which characterize the rate of the adaptive response

f1 - A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k

Q - A matrix of k by k, which is a non-negative-definite symmetric matrix

f - A vector-function (with length k) of the normal (or optimal) state

b - A diffusion coefficient, k by k matrix

mu0 - mortality at start period of time (baseline hazard)

theta - A displacement coefficient of the Gompertz function

ystart - A vector with length equal to number of dimensions used, defines starting values of covariates

tstart - A number that defines a start time (30 by default)

tend - A number, defines a final time (105 by default)

dt - A time interval between observations.

This function returns a table with simulated data, as shown in example below:

library(stpm)
data <- simdata_discr(N=10, ystart=75)
head(data)
##      id xi t1 t2          y1     y1.next
## [1,]  1  0 30 31 75.00000000 67.86573623
## [2,]  1  0 31 32 67.86573623 73.07974659
## [3,]  1  0 32 33 73.07974659 73.75880023
## [4,]  1  0 33 34 73.75880023 75.95539487
## [5,]  1  0 34 35 75.95539487 77.53082484
## [6,]  1  0 35 36 77.53082484 75.35931813

Continuous-time

The correstonding function is (\[k\] - a number of variables(covariates), equal to model’s dimension):

simdata_cont2(N=100, a=-0.05, f1=80, Q=2e-07, f=80, b=5, mu0=2e-05, theta=0.08, ystart=80, tstart=30, tend=105)

Here:

N - Number of individuals

a - A matrix of kxk, which characterize the rate of the adaptive response

f1 - A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k

Q - A matrix of k by k, which is a non-negative-definite symmetric matrix

f - A vector-function (with length k) of the normal (or optimal) state

b - A diffusion coefficient, k by k matrix

mu0 - mortality at start period of time (baseline hazard)

theta - A displacement coefficient of the Gompertz function

ystart - A vector with length equal to number of dimensions used, defines starting values of covariates

tstart - A number that defines a start time (30 by default)

tend - A number, defines a final time (105 by default)

This function returns a table with simulated data, as shown in example below:

library(stpm)
data <- simdata_cont(N=10)
head(data)
##   id xi          t1          t2          y1     y1.next
## 1  1  0 84.19505282 85.19873357 79.91748242 75.59100104
## 2  1  0 85.19873357 86.98758681 75.59100104 72.91514416
## 3  1  0 86.98758681 88.94467380 72.91514416 83.63339466
## 4  1  0 88.94467380 89.98709535 83.63339466 85.38250169
## 5  1  0 89.98709535 91.61485582 85.38250169 87.87385700
## 6  1  1 91.61485582 92.78031357 87.87385700          NA

Simulation strategies

R-package spm currently offers continuous- and discrete time simulations. Below we describe the simulations in details. In general, the input to each corresponding function: simdata_cont_MD(...) for continuous-time and simdata_discr_MD(...) for discrete-time simulations.

Continuous-time simulation strategies

Step 1

We model observations from a subject (which can be any system in general) and at first, we think that the subject is alive and compute the starting observation time t1 and the next time t2:

t1 = runif(1, tstart, tend) t2 = t1 + 2*runif(1, 0, 1)

Here runif() a random number generator which returns uniformly distributed value. We assume that the t1 as a random value, uniformly distributed from the start time (tstart) to end (tend).

Step 2

Computing y1 (an observed variable) from the previous observation:

if event = False:
  y1 = rnorm(1, ystart, sd0)
} else {
  y1 = y2
}

Here rnorm(...) is a random number generator which returns normally distributed values.

Step 3

In order to compute y2 , we need to compute a survival fuction S based on the equations 3, 4 and 5. We then compare the S to the random number, uniformly distributed. If S is larger than that number, than we assume that the event is happened (death of subject or system failure). Otherwise we compute y2 and proceed to the next iteration:

if S > runif(1, 0, 1) : 
    y2 = rnorm(1, m, sqrt(gamma))
    event = True
    new_subject = True
else if event = False:
  y2 = rnorm(1, m, sqrt(gamma))
  event = False
  new_record = True

Discrete-time simulation strategies

In this case we use equal intervals dt between observations and survival function S is computed directly from \(\mu\) (2):

\(S = e^{-1\mu(t_1)}\)

The rest of the discrete simulation routine is the same as in continuous-time simulation case.

References

[1] Woodbury M.A., Manton K.G., Random-Walk of Human Mortality and Aging. Theoretical Population Biology, 1977 11:37-48.

[2] Yashin, A.I., Manton K.G., Vaupel J.W. Mortality and aging in a heterogeneous population: a stochastic process model with observed and unobserved varia-bles. Theor Pop Biology, 1985 27.

[3] Yashin, A.I. et al. Stochastic model for analysis of longitudinal data on aging and mortality. Mathematical Biosciences, 2007 208(2) 538-551.

[4] Akushevich I., Kulminski A. and Manton K.: Life tables with covariates: Dynamic model for Nonlinear Analysis of Longitudinal Data. 2005. Mathematical Popu-lation Studies, 12(2), pp.: 51-80.

[5] Yashin, A. et al. Health decline, aging and mortality: how are they related? Biogerontology, 2007 8(3), 291-302.