The Stochastic Process Model (SPM) was developed several decades ago [1,2], and applied for analyses of clinical, demographic, epidemiologic longitudinal data as well as in many other studies that relate stochastic dynamics of repeated measures to the probability of end-points (outcomes). SPM links the dynamic of stochastical variables with a hazard rate as a quadratic function of the state variables [3]. The R-package, “stpm”, is a set of utilities to estimate parameters of stochastic process and modeling survival trajectories and time-to-event outcomes observed from longitudinal studies. It is a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.
require(devtools)
devtools::install_github("izhbannikov/stpm")
If you experience errors during installation, please download a binary file from the following url:
Than, execute this command (from R environment):
install.packages("<path to the downloaded r-package stpm>", repos=NULL, type="binary")
Data represents a typical longitudinal data in form of two datasets: longitudinal dataset (follow-up studies), in which one record represents a single observation, and vital (survival) statistics, where one record represents all information about the subject. Longitudinal dataset cat contain a subject ID (identification number), status (event(1)/no event(0)), time and measurements across the variables. The can handle an infinite number of variables but in practice, 5-7 variables is enough.
Below there is an example of clinical data that can be used in and we will discuss the field later. Longitudinal studies:
## ID IndicatorDeath Age DBP BMI
## 1 1 0 30 80.00000 25.00000
## 2 1 0 32 80.51659 26.61245
## 3 1 0 34 77.78412 29.16790
## 4 1 0 36 77.86665 32.40359
## 5 1 0 38 96.55673 31.92014
## 6 1 0 40 94.48616 32.89139
Vital statistics:
## ID IsDead LSmort
## 1 1 1 85.34578
## 2 2 1 80.55053
## 3 3 1 98.07315
## 4 4 1 81.29779
## 5 5 1 89.89829
## 6 6 1 72.47687
There are two main SPM types in the package: discrete-time model [4] and continuous-time model [3]. Discrete model assumes equal intervals between follow-up observations. The example of discrete dataset is given below.
library(stpm)
data <- simdata_discr(N=10, ystart=c(80), k=1)
head(data)
## id xi t1 t2 y1 y1.next
## [1,] 1 0 30 31 80.00000 83.11505
## [2,] 1 0 31 32 83.11505 82.44254
## [3,] 1 0 32 33 82.44254 84.26288
## [4,] 1 0 33 34 84.26288 74.32822
## [5,] 1 0 34 35 74.32822 66.50182
## [6,] 1 0 35 36 66.50182 68.22564
In this case there are equal intervals between t1 and t2 (Age and Age.next).
The opposite is continuous case, in which intervals between observations are not equal. The example of continuous case dataset is shown below:
library(stpm)
data <- simdata_cont(N=5,ystart = c(50))
head(data)
## id xi t1 t2 y1 y1.next
## 1 1 0 34.64537 35.88450 48.93623 50.04893
## 2 1 0 35.88450 37.52373 50.04893 43.54601
## 3 1 0 37.52373 40.50541 43.54601 43.71541
## 4 1 0 40.50541 43.31533 43.71541 41.48778
## 5 1 0 43.31533 44.97644 41.48778 44.79841
## 6 1 0 44.97644 46.32351 44.79841 37.87642
In discrete model, we use the following assumptions: \[ \bar{y}(t+1) = \bar{u} + \bar{R} \times \bar{y}(t) + \bar{\epsilon} \] (1) \[ \mu(t) = \mu_0(t) + \bar{b}(t) \times \bar{y}(t) + \bar{Q} \times \bar{y}(t)^2 \] (2)
Where: \[ \mu_0(t) = \mu_0 e^{\theta t} \] \[ \bar{b}(t) = \bar{b} e^{\theta t} \] \[ \bar{Q}(t) = \bar{Q} e^{\theta t} \]
library(stpm)
data <- simdata_discr(N=2000)
#Parameters estimation
pars <- spm_discrete(data)
pars
## $pars1
## $pars1$theta
## [1] 0.082
##
## $pars1$mu0
## [1] 0.0001255953378
##
## $pars1$b
## [1] -2.933038626e-06
##
## $pars1$Q
## [,1]
## [1,] 1.826654965e-08
##
## $pars1$u
## [1] 3.991961635
##
## $pars1$R
## [1] 0.9501822089
##
## $pars1$Sigma
## [1] 4.997780501
##
##
## $pars2
## $pars2$a
## [,1]
## [1,] -0.04981779111
##
## $pars2$f1
## [,1]
## [1,] 80.13124522
##
## $pars2$Q
## [,1]
## [1,] 1.826654965e-08
##
## $pars2$f
## [,1]
## [1,] 80.28441829
##
## $pars2$b
## [,1]
## [1,] 4.997780501
##
## $pars2$mu0
## [,1]
## [1,] 7.856687818e-06
##
## $pars2$theta
## [1] 0.082
\[ \mu(u) = \mu_0(u) + (\bar{m}(u) - \bar{f}(u)^* \times \bar{Q}(u) \times (\bar{m}(u) - \bar{f}(u)) + Tr(\bar{Q}(u) \times \bar{\gamma}(u)) \] (3)
\[ dm(t)/dt = \bar{a}(t) \times (\bar{m}(t) - \bar{f_1}(t)) - 2 \bar{\gamma}(t) \times \bar{Q}(t) \times (\bar{m}(t) - \bar{f}(t)) \] (4) \[ d\bar{\gamma}(t)/dt = \bar{a}(t) \times \bar{\gamma}(t) + \bar{\gamma}(t) \times \bar{a}(t)^* + \bar{b}(t) \times \bar{b}(t)^* - 2 \bar{\gamma}{t} \times \bar{Q}(t) \times \bar{\gamma}(t) \] (5)
library(stpm)
#Reading the data:
data <- simdata_cont(N=100)
head(data)
## id xi t1 t2 y1 y1.next
## 1 1 0 48.71323423 51.61380979 81.56853722 81.38800629
## 2 1 0 51.61380979 54.48693801 81.38800629 82.69331064
## 3 1 0 54.48693801 56.67844415 82.69331064 78.73080120
## 4 1 0 56.67844415 58.56543745 78.73080120 79.64379327
## 5 1 0 58.56543745 61.16421601 79.64379327 74.06096143
## 6 1 0 61.16421601 62.93750423 74.06096143 81.39526528
#Parameters estimation:
pars <- spm_continuous(dat=data[,2:6],a=-0.05, f1=80,
Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08, k = 1)
pars
## $a
## [,1]
## [1,] -0.05
##
## $f1
## [,1]
## [1,] 80
##
## $Q
## [,1]
## [1,] 2.83213213e-08
##
## $f
## [,1]
## [1,] 80
##
## $b
## [,1]
## [1,] 5
##
## $mu0
## [1] 2.000000614e-05
##
## $theta
## [1] 0.08000000001
##
## $limit
## [1] FALSE
\[ Q = Q \] \[ \bar{a} = \bar{R} - diag(k) \] \[ \bar{b} = \bar{\epsilon} \] \[ \bar{f1} = -1 \times \bar{u} \times \bar{a^{-1}} \] \[ \bar{f} = -0.5 \times \bar{b} \times \bar{Q^{-1}} \] \[ mu_0 = mu_0 - \bar{f} \times \bar{Q} \times t(\bar{f}) \] \[ \theta = \theta \]
In previous models, we assumed that coefficients is sort of time-dependant: we multiplied them on to \[e^{\theta t}\]. In general, this may not be the case [5]. We extend this to a general case, i.e. (we consider one-dimensional case):
\[ \bar{a(t)} = par_1 t + par_2 \] - linear function.
The corresponding equations will be equivalent to one-dimensional continuous case described above.
library(stpm)
#Data preparation:
n <- 500
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data[,2:6],
start = list(a = -0.05, f1 = 80, Q = 2e-08, f = 80, b = 5, mu0 = 0.001),
f = list(at = "a", f1t = "f1", Qt = "Q", ft = "f", bt = "b", mu0t= "mu0"))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.04653639696
##
## [[1]]$f1
## [1] 79.0433917
##
## [[1]]$Q
## [1] 1.729348342e-08
##
## [[1]]$f
## [1] 99.51038054
##
## [[1]]$b
## [1] 3.75
##
## [[1]]$mu0
## [1] 0.001249983971
We added one- and multi- dimensional simulation to be able to generate test data for hyphotesis testing. Data, which can be simulated can be discrete (equal intervals between observations) and continuous (with arbitrary intervals).
The corresponding function is:
simdata_discr(N=100, a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.08, ystart=80, tstart=30, tend=105, dt=1, k=1)
Here:
N
- Number of individuals
a
- A matrix of k
xk
, which characterize the rate of the adaptive response
f1
- A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k
Q
- A matrix of k
by k
, which is a non-negative-definite symmetric matrix
f
- A vector-function (with length k
) of the normal (or optimal) state
b
- A diffusion coefficient, k
by k
matrix
mu0
- mortality at start period of time (baseline hazard)
theta
- A displacement coefficient of the Gompertz function
ystart
- A vector with length equal to number of dimensions used, defines starting values of covariates
tstart
- A number that defines a start time (30 by default)
tend
- A number, defines a final time (105 by default)
dt
- A time interval between observations.
k
- number of dimensions (1 by default)
This function returns a table with simulated data, as shown in example below:
library(stpm)
data <- simdata_discr(N=10, ystart=c(75, 94), k=2)
head(data)
## id xi t1 t2 y1 y1.next y2 y2.next
## [1,] 1 0 30 31 75.00000000 75.57777849 94.00000000 92.93744262
## [2,] 1 0 31 32 75.57777849 72.37334305 92.93744262 81.42440622
## [3,] 1 0 32 33 72.37334305 66.24409359 81.42440622 75.78225644
## [4,] 1 0 33 34 66.24409359 64.73618361 75.78225644 81.40334435
## [5,] 1 0 34 35 64.73618361 57.31958608 81.40334435 80.47101445
## [6,] 1 0 35 36 57.31958608 49.85973915 80.47101445 73.23008213
The correstonding function is:
simdata_cont(N=100, a=-0.05, f1=80, Q=2e-07, f=80, b=5, mu0=2e-05, theta=0.08, ystart=80, tstart=30, tend=105, k=1)
Here:
N
- Number of individuals
a
- A matrix of k
xk
, which characterize the rate of the adaptive response
f1
- A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k
Q
- A matrix of k
by k
, which is a non-negative-definite symmetric matrix
f
- A vector-function (with length k
) of the normal (or optimal) state
b
- A diffusion coefficient, k
by k
matrix
mu0
- mortality at start period of time (baseline hazard)
theta
- A displacement coefficient of the Gompertz function
ystart
- A vector with length equal to number of dimensions used, defines starting values of covariates
tstart
- A number that defines a start time (30 by default)
tend
- A number, defines a final time (105 by default)
k
- number of dimensions (1 by default)
This function returns a table with simulated data, as shown in example below:
library(stpm)
data <- simdata_cont(N=10)
head(data)
## id xi t1 t2 y1 y1.next
## 1 1 0 72.84077415 73.88064023 78.66128757 70.16129802
## 2 1 0 73.88064023 75.38970150 70.16129802 69.19208254
## 3 1 0 75.38970150 78.06740699 69.19208254 69.46322879
## 4 1 0 78.06740699 80.41219847 69.46322879 66.96488916
## 5 1 0 80.41219847 82.39449764 66.96488916 70.84761650
## 6 1 0 82.39449764 84.72271349 70.84761650 63.54851804
R-package spm
currently offers continuous- and discrete time simulations. Below we describe the simulations in details. In general, the input to each corresponding function: simdata_cont_MD(...)
for continuous-time and simdata_discr_MD(...)
for discrete-time simulations.
We model observations from a subject (which can be any system in general) and at first, we think that the subject is alive and compute the starting observation time t1
and the next time t2
:
t1 = runif(1, tstart, tend)
t2 = t1 + 2*runif(1, 0, 1)
Here runif()
a random number generator which returns uniformly distributed value. We assume that the t1
as a random value, uniformly distributed from the start time (tstart
) to end (tend
).
Computing y1 (an observed variable) from the previous observation:
if event = False:
y1 = rnorm(1, ystart, sd0)
} else {
y1 = y2
}
Here rnorm(...)
is a random number generator which returns normally distributed values.
In order to compute y2 , we need to compute a survival fuction S
based on the equations 3, 4 and 5. We then compare the S
to the random number, uniformly distributed. If S
is larger than that number, than we assume that the event is happened (death of subject or system failure). Otherwise we compute y2
and proceed to the next iteration:
if S > runif(1, 0, 1) :
y2 = rnorm(1, m, sqrt(gamma))
event = True
new_subject = True
else if event = False:
y2 = rnorm(1, m, sqrt(gamma))
event = False
new_record = True
In this case we use equal intervals dt
between observations and survival function S
is computed directly from \(\mu\) (2):
\(S = e^{-1\mu(t_1)}\)
The rest of the discrete simulation routine is the same as in continuous-time simulation case.
[1] Woodbury M.A., Manton K.G., Random-Walk of Human Mortality and Aging. Theoretical Population Biology, 1977 11:37-48.
[2] Yashin, A.I., Manton K.G., Vaupel J.W. Mortality and aging in a heterogeneous population: a stochastic process model with observed and unobserved varia-bles. Theor Pop Biology, 1985 27.
[3] Yashin, A.I. et al. Stochastic model for analysis of longitudinal data on aging and mortality. Mathematical Biosciences, 2007 208(2) 538-551.
[4] Akushevich I., Kulminski A. and Manton K.: Life tables with covariates: Dynamic model for Nonlinear Analysis of Longitudinal Data. 2005. Mathematical Popu-lation Studies, 12(2), pp.: 51-80.
[5] Yashin, A. et al. Health decline, aging and mortality: how are they related? Biogerontology, 2007 8(3), 291-302.