The Stochastic Process Model (SPM) was developed several decades ago [1,2], and applied for analyses of clinical, demographic, epidemiologic longitudinal data as well as in many other studies that relate stochastic dynamics of repeated measures to the probability of end-points (outcomes). SPM links the dynamic of stochastical variables with a hazard rate as a quadratic function of the state variables [3]. The R-package, “stpm”, is a set of utilities to estimate parameters of stochastic process and modeling survival trajectories and time-to-event outcomes observed from longitudinal studies. It is a general framework for studying and modeling survival (censored) traits depending on random trajectories (stochastic paths) of variables.
require(devtools)
devtools::install_github("izhbannikov/stpm")
If you experience errors during installation, please download a binary file from the following url:
Than, execute this command (from R environment):
install.packages("<path to the downloaded r-package stpm>", repos=NULL, type="binary")
Data represents a typical longitudinal data in form of two datasets: longitudinal dataset (follow-up studies), in which one record represents a single observation, and vital (survival) statistics, where one record represents all information about the subject. Longitudinal dataset cat contain a subject ID (identification number), status (event(1)/no event(0)), time and measurements across the variables. The can handle an infinite number of variables but in practice, 5-7 variables is enough.
Below there is an example of clinical data that can be used in and we will discuss the field later. Longitudinal studies:
## ID IndicatorDeath Age DBP BMI
## 1 1 0 30 80.00000 25.00000
## 2 1 0 32 80.51659 26.61245
## 3 1 0 34 77.78412 29.16790
## 4 1 0 36 77.86665 32.40359
## 5 1 0 38 96.55673 31.92014
## 6 1 0 40 94.48616 32.89139
Vital statistics:
## ID IsDead LSmort
## 1 1 1 85.34578
## 2 2 1 80.55053
## 3 3 1 98.07315
## 4 4 1 81.29779
## 5 5 1 89.89829
## 6 6 1 72.47687
There are two main SPM types in the package: discrete-time model [4] and continuous-time model [3]. Discrete model assumes equal intervals between follow-up observations. The example of discrete dataset is given below.
library(stpm)
data <- simdata_discr(N=10, ystart=80)
head(data)
## id xi t1 t2 y1 y1.next
## [1,] 1 0 30 31 80.00000 77.97124
## [2,] 1 0 31 32 77.97124 72.75948
## [3,] 1 0 32 33 72.75948 78.30266
## [4,] 1 0 33 34 78.30266 83.51840
## [5,] 1 0 34 35 83.51840 83.69369
## [6,] 1 0 35 36 83.69369 79.87638
In this case there are equal intervals between t1 and t2 (Age and Age.next).
The opposite is continuous case, in which intervals between observations are not equal. The example of continuous case dataset is shown below:
library(stpm)
data <- simdata_cont2(N=5,ystart = 50)
head(data)
## id xi t1 t2 y1 y1.next
## [1,] 0 0 37.91828 39.04887 50.99913 51.12754
## [2,] 0 0 39.04887 40.66078 51.12754 55.65004
## [3,] 0 0 40.66078 41.75316 55.65004 63.54484
## [4,] 0 0 41.75316 43.71354 63.54484 67.56044
## [5,] 0 0 43.71354 45.69289 67.56044 60.13357
## [6,] 0 0 45.69289 46.74288 60.13357 67.28961
In discrete model, we use the following assumptions: \[ \bar{y}(t+1) = \bar{u} + \bar{R} \times \bar{y}(t) + \bar{\epsilon} \] (1) \[ \mu(t) = \mu_0(t) + \bar{b}(t) \times \bar{y}(t) + \bar{Q} \times \bar{y}(t)^2 \] (2)
Where: \[ \mu_0(t) = \mu_0 e^{\theta t} \] \[ \bar{b}(t) = \bar{b} e^{\theta t} \] \[ \bar{Q}(t) = \bar{Q} e^{\theta t} \]
library(stpm)
data <- simdata_discr(N=200)
#Parameters estimation
pars <- spm_discrete(data)
pars
## $Ak2005
## $Ak2005$theta
## [1] 0.083
##
## $Ak2005$mu0
## [1] 8.369318058e-05
##
## $Ak2005$b
## [1] -2.069622929e-06
##
## $Ak2005$Q
## [,1]
## [1,] 1.39043211e-08
##
## $Ak2005$u
## [1] 4.0695147
##
## $Ak2005$R
## [1] 0.9497498373
##
## $Ak2005$Sigma
## [1] 5.033107967
##
##
## $Ya2007
## $Ya2007$a
## [,1]
## [1,] -0.05025016273
##
## $Ya2007$f1
## [,1]
## [1,] 80.9851049
##
## $Ya2007$Q
## [,1]
## [1,] 1.39043211e-08
##
## $Ya2007$f
## [,1]
## [1,] 74.42373178
##
## $Ya2007$b
## [,1]
## [1,] 5.033107967
##
## $Ya2007$mu0
## [,1]
## [1,] 6.678649707e-06
##
## $Ya2007$theta
## [1] 0.083
##
##
## attr(,"class")
## [1] "spm.discrete"
\[ \mu(u) = \mu_0(u) + (\bar{m}(u) - \bar{f}(u)^* \times \bar{Q}(u) \times (\bar{m}(u) - \bar{f}(u)) + Tr(\bar{Q}(u) \times \bar{\gamma}(u)) \] (3)
\[ dm(t)/dt = \bar{a}(t) \times (\bar{m}(t) - \bar{f_1}(t)) - 2 \bar{\gamma}(t) \times \bar{Q}(t) \times (\bar{m}(t) - \bar{f}(t)) \] (4) \[ d\bar{\gamma}(t)/dt = \bar{a}(t) \times \bar{\gamma}(t) + \bar{\gamma}(t) \times \bar{a}(t)^* + \bar{b}(t) \times \bar{b}(t)^* - 2 \bar{\gamma}{t} \times \bar{Q}(t) \times \bar{\gamma}(t) \] (5)
library(stpm)
#Reading the data:
data <- simdata_cont2(N=100)
head(data)
## id xi t1 t2 y1 y1.next
## [1,] 0 0 35.69370874 37.07987729 80.39139407 83.11449873
## [2,] 0 0 37.07987729 38.25836416 83.11449873 83.27321760
## [3,] 0 0 38.25836416 39.40633274 83.27321760 89.73567427
## [4,] 0 0 39.40633274 40.97531492 89.73567427 94.70807684
## [5,] 0 0 40.97531492 42.72154746 94.70807684 96.47615846
## [6,] 0 0 42.72154746 44.63677688 96.47615846 100.89649668
#Parameters estimation:
pars <- spm_continuous(dat=data,a=-0.05, f1=80,
Q=2e-8, f=80, b=5, mu0=2e-5, theta=0.08)
## Parameter theta achieved lower/upper bound.
## 0.072
pars
## $a
## [,1]
## [1,] -0.05453693777
##
## $f1
## [,1]
## [1,] 79.39475939
##
## $Q
## [,1]
## [1,] 2.160177469e-08
##
## $f
## [,1]
## [1,] 83.73782074
##
## $b
## [,1]
## [1,] 5.025732121
##
## $mu0
## [1] 1.839933512e-05
##
## $theta
## [1] 0.072
##
## $limit
## [1] TRUE
##
## attr(,"class")
## [1] "spm.continuous"
\[ Q = Q \] \[ \bar{a} = \bar{R} - diag(k) \] \[ \bar{b} = \bar{\epsilon} \] \[ \bar{f1} = -1 \times \bar{u} \times \bar{a^{-1}} \] \[ \bar{f} = -0.5 \times \bar{b} \times \bar{Q^{-1}} \] \[ mu_0 = mu_0 - \bar{f} \times \bar{Q} \times t(\bar{f}) \] \[ \theta = \theta \]
Here \[k\] is a number of variables (covariates), which is equal to model’s dimension.
In previous models, we assumed that coefficients is sort of time-dependant: we multiplied them on to \[e^{\theta t}\]. In general, this may not be the case [5]. We extend this to a general case, i.e. (we consider one-dimensional case):
\[ \bar{a(t)} = par_1 t + par_2 \] - linear function.
The corresponding equations will be equivalent to one-dimensional continuous case described above.
library(stpm)
#Data preparation:
n <- 500
data <- simdata_time_dep(N=n)
# Estimation:
opt.par <- spm_time_dep(data,
start = list(a = -0.05, f1 = 80, Q = 2e-08, f = 80, b = 5, mu0 = 0.001),
f = list(at = "a", f1t = "f1", Qt = "Q", ft = "f", bt = "b", mu0t= "mu0"))
opt.par
## [[1]]
## [[1]]$a
## [1] -0.03904617932
##
## [[1]]$f1
## [1] 79.36413946
##
## [[1]]$Q
## [1] 2.22664768e-08
##
## [[1]]$f
## [1] 100
##
## [[1]]$b
## [1] 3.750239397
##
## [[1]]$mu0
## [1] 0.001249916021
We added one- and multi- dimensional simulation to be able to generate test data for hyphotesis testing. Data, which can be simulated can be discrete (equal intervals between observations) and continuous (with arbitrary intervals).
The corresponding function is (\[k\] - a number of variables(covariates), equal to model’s dimension):
simdata_discr(N=100, a=-0.05, f1=80, Q=2e-8, f=80, b=5, mu0=1e-5, theta=0.08, ystart=80, tstart=30, tend=105, dt=1)
Here:
N
- Number of individuals
a
- A matrix of k
xk
, which characterize the rate of the adaptive response
f1
- A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k
Q
- A matrix of k
by k
, which is a non-negative-definite symmetric matrix
f
- A vector-function (with length k
) of the normal (or optimal) state
b
- A diffusion coefficient, k
by k
matrix
mu0
- mortality at start period of time (baseline hazard)
theta
- A displacement coefficient of the Gompertz function
ystart
- A vector with length equal to number of dimensions used, defines starting values of covariates
tstart
- A number that defines a start time (30 by default)
tend
- A number, defines a final time (105 by default)
dt
- A time interval between observations.
This function returns a table with simulated data, as shown in example below:
library(stpm)
data <- simdata_discr(N=10, ystart=75)
head(data)
## id xi t1 t2 y1 y1.next
## [1,] 1 0 30 31 75.00000000 67.86573623
## [2,] 1 0 31 32 67.86573623 73.07974659
## [3,] 1 0 32 33 73.07974659 73.75880023
## [4,] 1 0 33 34 73.75880023 75.95539487
## [5,] 1 0 34 35 75.95539487 77.53082484
## [6,] 1 0 35 36 77.53082484 75.35931813
The correstonding function is (\[k\] - a number of variables(covariates), equal to model’s dimension):
simdata_cont2(N=100, a=-0.05, f1=80, Q=2e-07, f=80, b=5, mu0=2e-05, theta=0.08, ystart=80, tstart=30, tend=105)
Here:
N
- Number of individuals
a
- A matrix of k
xk
, which characterize the rate of the adaptive response
f1
- A particular state, which if a deviation from the normal (or optimal). This is a vector with length of k
Q
- A matrix of k
by k
, which is a non-negative-definite symmetric matrix
f
- A vector-function (with length k
) of the normal (or optimal) state
b
- A diffusion coefficient, k
by k
matrix
mu0
- mortality at start period of time (baseline hazard)
theta
- A displacement coefficient of the Gompertz function
ystart
- A vector with length equal to number of dimensions used, defines starting values of covariates
tstart
- A number that defines a start time (30 by default)
tend
- A number, defines a final time (105 by default)
This function returns a table with simulated data, as shown in example below:
library(stpm)
data <- simdata_cont(N=10)
head(data)
## id xi t1 t2 y1 y1.next
## 1 1 0 84.19505282 85.19873357 79.91748242 75.59100104
## 2 1 0 85.19873357 86.98758681 75.59100104 72.91514416
## 3 1 0 86.98758681 88.94467380 72.91514416 83.63339466
## 4 1 0 88.94467380 89.98709535 83.63339466 85.38250169
## 5 1 0 89.98709535 91.61485582 85.38250169 87.87385700
## 6 1 1 91.61485582 92.78031357 87.87385700 NA
R-package spm
currently offers continuous- and discrete time simulations. Below we describe the simulations in details. In general, the input to each corresponding function: simdata_cont_MD(...)
for continuous-time and simdata_discr_MD(...)
for discrete-time simulations.
We model observations from a subject (which can be any system in general) and at first, we think that the subject is alive and compute the starting observation time t1
and the next time t2
:
t1 = runif(1, tstart, tend)
t2 = t1 + 2*runif(1, 0, 1)
Here runif()
a random number generator which returns uniformly distributed value. We assume that the t1
as a random value, uniformly distributed from the start time (tstart
) to end (tend
).
Computing y1 (an observed variable) from the previous observation:
if event = False:
y1 = rnorm(1, ystart, sd0)
} else {
y1 = y2
}
Here rnorm(...)
is a random number generator which returns normally distributed values.
In order to compute y2 , we need to compute a survival fuction S
based on the equations 3, 4 and 5. We then compare the S
to the random number, uniformly distributed. If S
is larger than that number, than we assume that the event is happened (death of subject or system failure). Otherwise we compute y2
and proceed to the next iteration:
if S > runif(1, 0, 1) :
y2 = rnorm(1, m, sqrt(gamma))
event = True
new_subject = True
else if event = False:
y2 = rnorm(1, m, sqrt(gamma))
event = False
new_record = True
In this case we use equal intervals dt
between observations and survival function S
is computed directly from \(\mu\) (2):
\(S = e^{-1\mu(t_1)}\)
The rest of the discrete simulation routine is the same as in continuous-time simulation case.
[1] Woodbury M.A., Manton K.G., Random-Walk of Human Mortality and Aging. Theoretical Population Biology, 1977 11:37-48.
[2] Yashin, A.I., Manton K.G., Vaupel J.W. Mortality and aging in a heterogeneous population: a stochastic process model with observed and unobserved varia-bles. Theor Pop Biology, 1985 27.
[3] Yashin, A.I. et al. Stochastic model for analysis of longitudinal data on aging and mortality. Mathematical Biosciences, 2007 208(2) 538-551.
[4] Akushevich I., Kulminski A. and Manton K.: Life tables with covariates: Dynamic model for Nonlinear Analysis of Longitudinal Data. 2005. Mathematical Popu-lation Studies, 12(2), pp.: 51-80.
[5] Yashin, A. et al. Health decline, aging and mortality: how are they related? Biogerontology, 2007 8(3), 291-302.