This vignette aims to provide a more thorough introduction to the features in multistateutils
than the brief overview in the README. It uses the same example dataset and models, but has more examples and accompanying discussion.
This guide assumes familiarity with multi-state modelling in R, this section in particular glosses over the details and just prepares models and data in order to demonstrate the features of multistateutils
. If you are unfamiliar with multi-state modelling then I would recommend reading de Wreede, Fiocco, and Putter (2011) or the mstate
tutorial by Putter.
For these examples the ebmt3
data set from mstate
will be used. This provides a simple illness-death model of patients following transplant. The initial state is patient having received transplantation, pr referring to platelet recovery (the ‘illness’), with relapse-free-survival (rfs) being the only sink state.
library(mstate)
#> Loading required package: survival
data(ebmt3)
head(ebmt3)
#> id prtime prstat rfstime rfsstat dissub age drmatch tcd
#> 1 1 23 1 744 0 CML >40 Gender mismatch No TCD
#> 2 2 35 1 360 1 CML >40 No gender mismatch No TCD
#> 3 3 26 1 135 1 CML >40 No gender mismatch No TCD
#> 4 4 22 1 995 0 AML 20-40 No gender mismatch No TCD
#> 5 5 29 1 422 1 AML 20-40 No gender mismatch No TCD
#> 6 6 38 1 119 1 ALL >40 No gender mismatch No TCD
mstate
provides a host of utility functions for working with multi-state models. For example, the trans.illdeath()
function provides the required transition matrix for this state structure (transMat
should be used when greater flexibility is required).
tmat <- trans.illdeath(c('transplant', 'pr', 'rfs'))
tmat
#> to
#> from transplant pr rfs
#> transplant NA 1 2
#> pr NA NA 3
#> rfs NA NA NA
The final data preparation step is to form the data from a wide format (each row corresponding to a patient) to a long format, where each row represents a potential patient-transition. The msprep
function from mstate
handles this for us. We’ll keep both the age
and dissub
covariates in this example.
long <- msprep(time=c(NA, 'prtime', 'rfstime'),
status=c(NA, 'prstat', 'rfsstat'),
data=ebmt3,
trans=tmat,
keep=c('age', 'dissub'))
head(long)
#> An object of class 'msdata'
#>
#> Data:
#> id from to trans Tstart Tstop time status age dissub
#> 1 1 1 2 1 0 23 23 1 >40 CML
#> 2 1 1 3 2 0 23 23 0 >40 CML
#> 3 1 2 3 3 23 744 721 0 >40 CML
#> 4 2 1 2 1 0 35 35 1 >40 CML
#> 5 2 1 3 2 0 35 35 0 >40 CML
#> 6 2 2 3 3 35 360 325 1 >40 CML
Clock-reset Weibull models will be fitted to these 3 transitions, which are semi-Markov models. Simulation is therefore needed to obtain transition probabilities as the Kolmogorov forward differential equation is no longer valid with the violation of the Markov assumption. We are going to assume that the baseline hazard isn’t proportional between transitions and there are no shared transition effects for simplicity’s sake.
Transition probabilities are defined as the probability of being in a state \(j\) at a time \(t\), given being in state \(h\) at time \(s\), as shown below where \(X(t)\) gives the state an individual is in at \(t\). This is all conditional on the individual parameterised by their covariates and history, which for this semi-Markov model only influences transition probabilities through state arrival times.
\[P_{h,j}(s, t) = \Pr(X(t) = j\ |\ X(s) = h)\]
We’ll estimate the transition probabilities of an individual with the covariates age=20-40
and dissub=AML
at 1 year after transplant.
The function that estimates transition probabilities is called predict_transitions
and has a very similar interface to flexsurv::pmatrix.simfs
. The parameters in the above equation have the following argument names:
times
(must be supplied)start_times
(defaults to 0)The code example below shows how to calculate transition probabilities for \(t=365\) (1 year) with \(s=0\); the transition probabilities for every state at 1 year after transplant given being in every state at transplant time. As with pmatrix.simfs
, although all the probabilities for every pairwise combination of states are calculated, they are sometimes redundant. For example, \(P_{h,j}(0, 365)\) where \(h=j=\text{rfs}\) is hardly a useful prediction.
predict_transitions(models, newdata, tmat, times=365)
#> age dissub start_time end_time start_state transplant pr rfs
#> 1 20-40 AML 0 365 transplant 0.4710346 0.1929229 0.3360425
#> 2 20-40 AML 0 365 pr 0.0000000 0.6851288 0.3148712
#> 3 20-40 AML 0 365 rfs 0.0000000 0.0000000 1.0000000
Note that this gives very similar responses to pmatrix.simfs
.
pmatrix.simfs(models, tmat, newdata=newdata, t=365)
#> [,1] [,2] [,3]
#> [1,] 0.47139 0.19170 0.33691
#> [2,] 0.00000 0.68672 0.31328
#> [3,] 0.00000 0.00000 1.00000
Confidence intervals can be constructed in the same fashion as pmatrix.simfs
, using draws from the multivariate-normal distribution of the parameter estimates.
predict_transitions(models, newdata, tmat, times=365, ci=TRUE, M=10)
#> age dissub start_time end_time start_state transplant_est pr_est
#> 1 20-40 AML 0 365 transplant 0.4692728 0.1952858
#> 2 20-40 AML 0 365 pr 0.0000000 0.6831183
#> 3 20-40 AML 0 365 rfs 0.0000000 0.0000000
#> rfs_est transplant_2.5% pr_2.5% rfs_2.5% transplant_97.5% pr_97.5%
#> 1 0.3354414 0.4600884 0.1852284 0.3184682 0.4871475 0.2103504
#> 2 0.3168817 0.0000000 0.6683496 0.2920842 0.0000000 0.7079158
#> 3 1.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
#> rfs_97.5%
#> 1 0.3476259
#> 2 0.3316504
#> 3 1.0000000
Which gives rather different results to those obtained from pmatrix.simfs
which seem to be too wide and the estimate value is far different to that obtained when run without CIs. I’m unsure why this is the case.
pmatrix.simfs(models, tmat, newdata=newdata, t=365, ci=TRUE, M=9)
#> [,1] [,2] [,3]
#> [1,] 0.4444444 0.3333333 0.2222222
#> [2,] 0.0000000 1.0000000 0.0000000
#> [3,] 0.0000000 0.0000000 1.0000000
#> attr(,"lower")
#> [,1] [,2] [,3]
#> [1,] 0.1111111 0.0000000 0.1083333
#> [2,] 0.0000000 0.3333333 0.0000000
#> [3,] 0.0000000 0.0000000 1.0000000
#> attr(,"upper")
#> [,1] [,2] [,3]
#> [1,] 0.7777778 0.4444444 0.6666667
#> [2,] 0.0000000 1.0000000 0.6666667
#> [3,] 0.0000000 0.0000000 1.0000000
#> attr(,"class")
#> [1] "fs.msm.est"
Note that on a single individual the speed-up isn’t present, with multistateutils
taking 4 times longer than flexsurv
, although the difference between 1.2s and 0.3s isn’t that noticeable in interactive work. The main benefit comes when estimating more involved probabilities, as will be demonstrated next.
library(microbenchmark)
microbenchmark("multistateutils"=predict_transitions(models, newdata, tmat, times=365),
"flexsurv"=pmatrix.simfs(models, tmat, newdata=newdata, t=365), times=10)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> multistateutils 1189.4420 1192.9079 1228.5475 1205.6445 1266.9068 1341.426
#> flexsurv 225.4583 228.6994 234.8431 233.8883 242.3195 245.129
#> neval cld
#> 10 b
#> 10 a
Frequently, it is desirable to estimate transition probabilities at multiple values of \(t\), in order to build up a picture of an individual’s disease progression. pmatrix.simfs
only allows a scalar for \(t\), so estimating probabilities at multiple values requires manually iterating through the time-scale. In the example below we will calculate transition probabilities at yearly intervals for 9 years.
predict_transitions(models, newdata, tmat, times=seq(9)*365)
#> age dissub start_time end_time start_state transplant pr rfs
#> 1 20-40 AML 0 365 transplant 0.4710978 0.1925549 0.3363473
#> 2 20-40 AML 0 365 pr 0.0000000 0.6842221 0.3157779
#> 3 20-40 AML 0 365 rfs 0.0000000 0.0000000 1.0000000
#> 4 20-40 AML 0 730 transplant 0.3558483 0.2068862 0.4372655
#> 5 20-40 AML 0 730 pr 0.0000000 0.5939398 0.4060602
#> 6 20-40 AML 0 730 rfs 0.0000000 0.0000000 1.0000000
#> 7 20-40 AML 0 1095 transplant 0.2882834 0.2094212 0.5022954
#> 8 20-40 AML 0 1095 pr 0.0000000 0.5342993 0.4657007
#> 9 20-40 AML 0 1095 rfs 0.0000000 0.0000000 1.0000000
#> 10 20-40 AML 0 1460 transplant 0.2421158 0.2052295 0.5526547
#> 11 20-40 AML 0 1460 pr 0.0000000 0.4887473 0.5112527
#> 12 20-40 AML 0 1460 rfs 0.0000000 0.0000000 1.0000000
#> 13 20-40 AML 0 1825 transplant 0.2070858 0.2008383 0.5920758
#> 14 20-40 AML 0 1825 pr 0.0000000 0.4528948 0.5471052
#> 15 20-40 AML 0 1825 rfs 0.0000000 0.0000000 1.0000000
#> 16 20-40 AML 0 2190 transplant 0.1805788 0.1955489 0.6238723
#> 17 20-40 AML 0 2190 pr 0.0000000 0.4222129 0.5777871
#> 18 20-40 AML 0 2190 rfs 0.0000000 0.0000000 1.0000000
#> 19 20-40 AML 0 2555 transplant 0.1592415 0.1884830 0.6522754
#> 20 20-40 AML 0 2555 pr 0.0000000 0.3957594 0.6042406
#> 21 20-40 AML 0 2555 rfs 0.0000000 0.0000000 1.0000000
#> 22 20-40 AML 0 2920 transplant 0.1413573 0.1823952 0.6762475
#> 23 20-40 AML 0 2920 pr 0.0000000 0.3744364 0.6255636
#> 24 20-40 AML 0 2920 rfs 0.0000000 0.0000000 1.0000000
#> 25 20-40 AML 0 3285 transplant 0.1264072 0.1770060 0.6965868
#> 26 20-40 AML 0 3285 pr 0.0000000 0.3540151 0.6459849
#> 27 20-40 AML 0 3285 rfs 0.0000000 0.0000000 1.0000000
In pmatrix.simfs
it is up to the user to manipulate the output to make it interpretable. Again, the probabilities agree with each other.
do.call('rbind', lapply(seq(9)*365, function(t) {
pmatrix.simfs(models, tmat, newdata=newdata, t=t)
}))
#> [,1] [,2] [,3]
#> [1,] 0.47190 0.19130 0.33680
#> [2,] 0.00000 0.68542 0.31458
#> [3,] 0.00000 0.00000 1.00000
#> [4,] 0.35124 0.20874 0.44002
#> [5,] 0.00000 0.59447 0.40553
#> [6,] 0.00000 0.00000 1.00000
#> [7,] 0.28242 0.20948 0.50810
#> [8,] 0.00000 0.53279 0.46721
#> [9,] 0.00000 0.00000 1.00000
#> [10,] 0.23923 0.20721 0.55356
#> [11,] 0.00000 0.48839 0.51161
#> [12,] 0.00000 0.00000 1.00000
#> [13,] 0.20380 0.19965 0.59655
#> [14,] 0.00000 0.45400 0.54600
#> [15,] 0.00000 0.00000 1.00000
#> [16,] 0.17818 0.19410 0.62772
#> [17,] 0.00000 0.41980 0.58020
#> [18,] 0.00000 0.00000 1.00000
#> [19,] 0.15851 0.18830 0.65319
#> [20,] 0.00000 0.39682 0.60318
#> [21,] 0.00000 0.00000 1.00000
#> [22,] 0.13948 0.17985 0.68067
#> [23,] 0.00000 0.37158 0.62842
#> [24,] 0.00000 0.00000 1.00000
#> [25,] 0.12301 0.17337 0.70362
#> [26,] 0.00000 0.35691 0.64309
#> [27,] 0.00000 0.00000 1.00000
By removing this boilerplate code, the speed increase starts to show, with the calculation of 8 additional time-points only increasing the runtime by 61% from 1.2s to 2s, while flexsurv
has a twelve-fold increase from 0.3s to 3.7s.
microbenchmark("multistateutils"=predict_transitions(models, newdata, tmat, times=seq(9)*365),
"flexsurv"={do.call('rbind', lapply(seq(9)*365, function(t) {
pmatrix.simfs(models, tmat, newdata=newdata, t=t)}))
}, times=10)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> multistateutils 1.517768 1.639387 1.661579 1.655867 1.671269 1.821268 10
#> flexsurv 2.099182 2.235882 2.272040 2.268024 2.337593 2.354112 10
#> cld
#> a
#> b
pmatrix.simfs
limits the user to using \(s=0\). In predict_transitions
this is fully customisable. For example, the call below shows estimates the 1-year transition probabilities conditioned on the individual being alive at 6 months (technically it also calculates the transition probabilities conditioned on being dead at 6 months in the third row, but these aren’t helpful). Notice how the probabilities of being dead at 1 year have decreased as a result.
predict_transitions(models, newdata, tmat, times=365, start_times = 365/2)
#> age dissub start_time end_time start_state transplant pr rfs
#> 1 20-40 AML 182.5 365 transplant 0.8168102 0.07502584 0.10816397
#> 2 20-40 AML 182.5 365 pr 0.0000000 0.90040957 0.09959043
#> 3 20-40 AML 182.5 365 rfs 0.0000000 0.00000000 1.00000000
Multiple values of \(s\) can be provided, such as the quarterly predictions below.
predict_transitions(models, newdata, tmat, times=365,
start_times = c(0.25, 0.5, 0.75) * 365)
#> age dissub start_time end_time start_state transplant pr rfs
#> 1 20-40 AML 91.25 365 transplant 0.7004280 0.11672802 0.18284397
#> 2 20-40 AML 91.25 365 pr 0.0000000 0.83214486 0.16785514
#> 3 20-40 AML 91.25 365 rfs 0.0000000 0.00000000 1.00000000
#> 4 20-40 AML 182.50 365 transplant 0.8140109 0.07582872 0.11016035
#> 5 20-40 AML 182.50 365 pr 0.0000000 0.89807465 0.10192535
#> 6 20-40 AML 182.50 365 rfs 0.0000000 0.00000000 1.00000000
#> 7 20-40 AML 273.75 365 transplant 0.9129281 0.03733323 0.04973863
#> 8 20-40 AML 273.75 365 pr 0.0000000 0.95161112 0.04838888
#> 9 20-40 AML 273.75 365 rfs 0.0000000 0.00000000 1.00000000
Finally, any combination of number of \(s\) and \(t\) can be specified provided that all \(s\) are less than \(min(t)\).
predict_transitions(models, newdata, tmat, times=seq(2)*365,
start_times = c(0.25, 0.5, 0.75) * 365)
#> age dissub start_time end_time start_state transplant pr
#> 1 20-40 AML 91.25 365 transplant 0.6988764 0.11940075
#> 2 20-40 AML 91.25 365 pr 0.0000000 0.83368648
#> 3 20-40 AML 91.25 365 rfs 0.0000000 0.00000000
#> 4 20-40 AML 91.25 730 transplant 0.5253034 0.16185768
#> 5 20-40 AML 91.25 730 pr 0.0000000 0.72325398
#> 6 20-40 AML 91.25 730 rfs 0.0000000 0.00000000
#> 7 20-40 AML 182.50 365 transplant 0.8137669 0.07776576
#> 8 20-40 AML 182.50 365 pr 0.0000000 0.90012034
#> 9 20-40 AML 182.50 365 rfs 0.0000000 0.00000000
#> 10 20-40 AML 182.50 730 transplant 0.6116596 0.13697101
#> 11 20-40 AML 182.50 730 pr 0.0000000 0.77986935
#> 12 20-40 AML 182.50 730 rfs 0.0000000 0.00000000
#> 13 20-40 AML 273.75 365 transplant 0.9103860 0.03969400
#> 14 20-40 AML 273.75 365 pr 0.0000000 0.95271779
#> 15 20-40 AML 273.75 365 rfs 0.0000000 0.00000000
#> 16 20-40 AML 273.75 730 transplant 0.6842824 0.11432028
#> 17 20-40 AML 273.75 730 pr 0.0000000 0.82426667
#> 18 20-40 AML 273.75 730 rfs 0.0000000 0.00000000
#> rfs
#> 1 0.18172285
#> 2 0.16631352
#> 3 1.00000000
#> 4 0.31283895
#> 5 0.27674602
#> 6 1.00000000
#> 7 0.10846736
#> 8 0.09987966
#> 9 1.00000000
#> 10 0.25136936
#> 11 0.22013065
#> 12 1.00000000
#> 13 0.04991999
#> 14 0.04728221
#> 15 1.00000000
#> 16 0.20139729
#> 17 0.17573333
#> 18 1.00000000
Note that obtaining these additional probabilities does not increase the runtime of the function.
It’s useful to be able to estimating transition probabilities for multiple individuals at once, for example to see how the outcomes differ for patients with different characteristics. predict_transitions
simply handles multiple rows supplied to newdata
.
predict_transitions(models, newdata_multi, tmat, times=365)
#> age dissub start_time end_time start_state transplant pr rfs
#> 1 20-40 AML 0 365 transplant 0.4703591 0.1913274 0.3383134
#> 2 20-40 AML 0 365 pr 0.0000000 0.6857389 0.3142611
#> 3 20-40 AML 0 365 rfs 0.0000000 0.0000000 1.0000000
#> 4 >40 CML 0 365 transplant 0.4301921 0.1976391 0.3721687
#> 5 >40 CML 0 365 pr 0.0000000 0.6524943 0.3475057
#> 6 >40 CML 0 365 rfs 0.0000000 0.0000000 1.0000000
As with multiple times, pmatrix.simfs
only handles a single individual at a time.
pmatrix.simfs(models, tmat, newdata=newdata_multi, t=365)
#> Error in pars.fmsm(x = x, trans = trans, newdata = newdata, tvar = tvar): `newdata` has 2 rows. It must either have one row, or one row for each of the 3 allowed transitions
And the user has to manually iterate through each new individual they would like to estimate transition probabilities for.
do.call('rbind', lapply(seq(nrow(newdata_multi)), function(i) {
pmatrix.simfs(models, tmat, newdata=newdata_multi[i, ], t=365)
}))
#> [,1] [,2] [,3]
#> [1,] 0.47075 0.19079 0.33846
#> [2,] 0.00000 0.68897 0.31103
#> [3,] 0.00000 0.00000 1.00000
#> [4,] 0.42750 0.19910 0.37340
#> [5,] 0.00000 0.65231 0.34769
#> [6,] 0.00000 0.00000 1.00000
The Markov assumption has already been violated by the use of a clock-reset time-scale, which is why we are using simulation in the first place. We can therefore add an other violation without it affecting our methodology. Owing to the use of clock-reset, the model does not take time-since-transplant into account for patients who have platelet recovery. This could be an important prognostic factor in that individual’s survival. Similar scenarios are common in multi-state modelling, and are termed state-arrival
times. We’ll make a new set of models, where the transition from pr
to rfs
(transition 3) takes time-since-transplant into account. This information is already held in the Tstart
variable produced by msprep
.
models_arrival <- lapply(1:3, function(i) {
if (i == 3) {
flexsurvreg(Surv(time, status) ~ age + dissub + Tstart, data=long, dist='weibull')
} else {
flexsurvreg(Surv(time, status) ~ age + dissub, data=long, dist='weibull')
}
})
Looking at the coefficient for this variable and it does seem to be prognostic for time-to-rfs.
models_arrival[[3]]
#> Call:
#> flexsurvreg(formula = Surv(time, status) ~ age + dissub + Tstart,
#> data = long, dist = "weibull")
#>
#> Estimates:
#> data mean est L95% U95% se exp(est)
#> shape NA 4.75e-01 4.58e-01 4.92e-01 8.64e-03 NA
#> scale NA 1.97e+03 1.53e+03 2.55e+03 2.59e+02 NA
#> age20-40 4.76e-01 5.95e-02 -2.01e-01 3.20e-01 1.33e-01 1.06e+00
#> age>40 3.28e-01 -4.25e-01 -7.03e-01 -1.47e-01 1.42e-01 6.54e-01
#> dissubALL 2.07e-01 -2.37e-01 -4.90e-01 1.71e-02 1.29e-01 7.89e-01
#> dissubCML 3.99e-01 3.34e-01 1.23e-01 5.45e-01 1.08e-01 1.40e+00
#> Tstart 7.95e+00 3.27e-02 2.64e-02 3.90e-02 3.22e-03 1.03e+00
#> L95% U95%
#> shape NA NA
#> scale NA NA
#> age20-40 8.18e-01 1.38e+00
#> age>40 4.95e-01 8.63e-01
#> dissubALL 6.13e-01 1.02e+00
#> dissubCML 1.13e+00 1.73e+00
#> Tstart 1.03e+00 1.04e+00
#>
#> N = 5577, Events: 2010, Censored: 3567
#> Total time at risk: 2940953
#> Log-likelihood = -15286.67, df = 7
#> AIC = 30587.34
To estimate transition probabilities for models with state-arrival times, the variables needs to be included in newdata
with an initial value, i.e. the value this variable has when the global clock is 0.
Then in predict_transitions
simply specify which variables in newdata
are time-dependent, that is they increment at each transition along with the current clock value. This is particularly useful for modelling patient age at each state entry, rather than at the starting state. Notice how this slightly changes the probability of being in rfs from a person starting in transplant compared to the example below that omits the tcovs
argument.
predict_transitions(models_arrival, newdata_arrival, tmat, times=365, tcovs='Tstart')
#> age dissub Tstart start_time end_time start_state transplant pr
#> 1 20-40 AML 0 0 365 transplant 0.4702542 0.2184094
#> 2 20-40 AML 0 0 365 pr 0.0000000 0.6485438
#> 3 20-40 AML 0 0 365 rfs 0.0000000 0.0000000
#> rfs
#> 1 0.3113364
#> 2 0.3514562
#> 3 1.0000000
predict_transitions(models_arrival, newdata_arrival, tmat, times=365)
#> age dissub Tstart start_time end_time start_state transplant pr
#> 1 20-40 AML 0 0 365 transplant 0.4693722 0.1827241
#> 2 20-40 AML 0 0 365 pr 0.0000000 0.6512788
#> 3 20-40 AML 0 0 365 rfs 0.0000000 0.0000000
#> rfs
#> 1 0.3479037
#> 2 0.3487212
#> 3 1.0000000
This functionality is implemented in pmatrix.simfs
, but the tcovs
argument actually has no impact on the transition probabilities, as evidenced below.
Sometimes greater flexibility in the model structure is required, so that every transition isn’t obliged to use the same distribution. This could be useful if any transitions have few observations and would benefit from a simpler model such as an exponential, or if there is a requirement to use existing models from literature. Furthermore, if prediction is the goal, then it could be the case that allowing different distributions for each transition offers better overall fit.
An example is shown below, where each transition uses a different distribution family.
models_mix <- lapply(1:3, function(i) {
if (i == 1) {
flexsurvreg(Surv(time, status) ~ age + dissub, data=long, dist='weibull')
} else if (i == 2) {
flexsurvreg(Surv(time, status) ~ age + dissub, data=long, dist='exp')
} else {
flexsurvreg(Surv(time, status) ~ age + dissub, data=long, dist='lnorm')
}
})
predict_transitions
handles these cases with no problems; currently the following distributions are supported:
predict_transitions(models_mix, newdata, tmat, times=365)
#> age dissub start_time end_time start_state transplant pr rfs
#> 1 20-40 AML 0 365 transplant 0.5396042 0.2042033 0.2561925
#> 2 20-40 AML 0 365 pr 0.0000000 0.6511827 0.3488173
#> 3 20-40 AML 0 365 rfs 0.0000000 0.0000000 1.0000000
pmatrix.simfs
does not seem to function correctly under these situations.
Similarly, the length of stay functionality provided by totlos.simfs
has also been extended to allow for estimates at multiple time-points, states, and individuals to be calculated at the same time. As shown below, the function parameters are very similar and the estimates are very close to those produced by totlos.simf
.
length_of_stay(models,
newdata=newdata,
tmat, times=365.25*3,
start_state='transplant')
#> age dissub t start_state transplant pr rfs
#> 1 20-40 AML 1095.75 transplant 484.7132 209.4309 401.6059
totlos.simfs(models, tmat, t=365.25*3, start=1, newdata=newdata)
#> 1 2 3
#> 484.7266 205.2006 405.8228
Rather than provide a example for each argument like in the previous section, the code chunk below demonstrates that vectors can be provided to both times
and start
, and newdata
accept a data frame with multiple rows.
length_of_stay(models,
newdata=data.frame(age=c(">40", ">40"),
dissub=c('CML', 'AML')),
tmat, times=c(1, 3, 5)*365.25,
start_state=c('transplant', 'pr'))
#> age dissub t start_state transplant pr rfs
#> 1 >40 CML 365.25 transplant 104.02196 30.25057 48.35247
#> 2 >40 AML 365.25 transplant 97.92466 31.55238 53.14797
#> 3 >40 CML 365.25 pr NA 136.84682 45.77818
#> 4 >40 AML 365.25 pr NA 131.94277 50.68223
#> 5 >40 CML 1095.75 transplant 220.88007 105.18169 221.81324
#> 6 >40 AML 1095.75 transplant 200.88641 106.21248 240.77611
#> 7 >40 CML 1095.75 pr NA 342.03048 205.84452
#> 8 >40 AML 1095.75 pr NA 322.30169 225.57331
#> 9 >40 CML 1826.25 transplant 295.58761 176.58564 440.95176
#> 10 >40 AML 1826.25 transplant 263.29604 174.98446 474.84450
#> 11 >40 CML 1826.25 pr NA 505.47031 407.65469
#> 12 >40 AML 1826.25 pr NA 469.13479 443.99021
Another feature in multistateutils
is a visualization of a predicted pathway through the state transition model, calculated using dynamic prediction and provided in the function plot_predicted_pathway
. It estimates state occupancy probabilities at discrete time-points and displays the flow between them in the manner of a Sankey diagram.
This visualization, an example of which is shown below for the 20-40 year old AML patient with biennial time-points, differs from traditional stacked line graph plots that only display estimates conditioned on a single time-point and starting state, i.e. a fixed \(s\) and \(h\) in the transition probability specification. plot_predicted_pathway
instead displays dynamic predictions, where both \(s\) and \(h\) are allowed to vary and are updated at each time-point.
Note that the image below is actually an HTML widget and therefore interactive - try moving the states around. In the future I might try and implement a default optimal layout, along with explicitly displaying the time-scale.
\[P_{h,j}(s, t) = \Pr(X(t) = j\ |\ X(s) = h)\]
time_points <- seq(0, 10, by=2) * 365.25
plot_predicted_pathway(models, tmat, newdata, time_points, 1)
In addition to predicting an individual’s progression through the statespace, we can also simulate an entire cohort’s passage. This is useful for situations where we have a heterogeneous group and are interested in obtaining estimates of measures such as the amount of time spent in each state for individuals with certain covariates. A common use is in health economic modelling, where multi-state models are used to represent patient treatment pathways with costs associated with each treatment state. The application of multi-state modelling in these contexts is often referred to as discrete event simulation and can be used to estimate the total number of patients receiving a certain treatment in a given timeframe, or survival rates of individuals with certain characteristics.
The cohort_simulation
function provides this functionality, and is specified very similarly to the other functions in this package, requiring:
flexsurv
parametric modelsThe output is a long data frame where each row corresponds to an individual entering a new state. The first rows below show that every individual enters the system in state 1 at time 0, which is the default behaviour.
head(sim)
#> id age dissub state time
#> 1 0 >40 CML transplant 0
#> 2 2 >40 CML transplant 0
#> 3 6 20-40 CML transplant 0
#> 4 14 >40 CML transplant 0
#> 5 30 <=20 ALL transplant 0
#> 6 62 <=20 ALL transplant 0
These initial conditions can be changed; for example, the start_state
argument accepts either a single value representing the state that everyone enters in, or a vector of values with as many entries as there are observations in newdata
. The simulation below evenly splits patients between starting in the initial state (transplant) and platelet recovery (state 2).
sim2 <- cohort_simulation(models, ebmt3, tmat,
start_state = sample(c(1, 2), nrow(ebmt3), replace=T))
head(sim2)
#> id age dissub state time
#> 1 0 >40 CML transplant 0
#> 2 2 >40 CML transplant 0
#> 3 6 20-40 CML pr 0
#> 4 14 >40 CML pr 0
#> 5 30 <=20 ALL transplant 0
#> 6 62 <=20 ALL transplant 0
Likewise, the individuals don’t have to enter the simulation at \(t=0\). The start_time
parameter is specified in the same manner as start_state
, accepting either a single value or a vector containing a time for each individual. The example below shows the case where individuals enter the system every 10 days, which means having a transplant in the ebmt3
dataset we’ve been using.
sim3 <- cohort_simulation(models, ebmt3, tmat,
start_state = sample(c(1, 2), nrow(ebmt3), replace=T),
start_time = seq(0, 10*(nrow(ebmt3)-1), by=10))
head(sim3)
#> id age dissub state time
#> 1 0 >40 CML pr 0.0000000
#> 2 0 >40 CML rfs 0.4759826
#> 3 1 >40 CML transplant 10.0000000
#> 4 2 >40 CML transplant 20.0000000
#> 5 1 >40 CML rfs 27.6222367
#> 6 3 20-40 AML transplant 30.0000000
It is often useful to run a simulation over a set time-period; for example, if we are interested in looking at the cost to the health care provider from treating a particular disease over 10 years. The time_limit
argument allows for this use-case by terminating the simulation at the given time.
The model below uses the same incidence model of a transplant every ten days but now terminates at 10 years, which is reflected in the simulation output.
sim4 <- cohort_simulation(models, ebmt3, tmat,
start_state = sample(c(1, 2), nrow(ebmt3), replace=T),
start_time = seq(0, 10*(nrow(ebmt3)-1), by=10),
time_limit = 10*365.25)
tail(sim4)
#> id age dissub state time
#> 652 190 20-40 CML rfs 3629.890
#> 653 363 20-40 ALL transplant 3630.000
#> 654 357 >40 ALL rfs 3639.293
#> 655 364 20-40 AML transplant 3640.000
#> 656 311 20-40 ALL rfs 3644.309
#> 657 365 20-40 AML transplant 3650.000
One challenge with using simulation to obtain estimates is that it is possible to generate unrealistic situations, such as a person living for several hundred years, due to the use of unbounded probability distributions to model event times.
For example, in the output of the first cohort simulation from the previous section, the oldest patient dies 955 years after the transplant!
sim %>%
arrange(desc(time)) %>%
head()
#> id age dissub state time
#> 1 223 20-40 AML rfs 348827.3
#> 2 254 20-40 ALL rfs 204135.0
#> 3 360 20-40 CML rfs 147332.2
#> 4 51 <=20 AML rfs 136822.7
#> 5 1884 20-40 CML rfs 124720.5
#> 6 1927 20-40 CML rfs 113875.6
To combat this, each of prediction_transitions
, length_of_stay
, and cohort_simulation
allow the option to specify a hard limit after which a patient is considered dead, effectively placing bounds on the transition time distributions.
This is achieved through 3 arguments:
agelimit
: This is either FALSE
, in which case no limit is applied, or is a numeric value detailing the requested limitagecol
: The column in newdata
that holds the patient age at entry to the simulationagescale
: Often, age is measured on a different time-scale to the rest of the study. For example, in many health studies age will be measured in years while study time will be recorded in months or days. This argument provides the scaling factor to be applied to the individual’s age to place it on the same time-scale as the study. Defaults to 1.The example below shows how to use this in practice. Firstly, however, a dummy continuous age covariate needs to be added, as ebmt3
only provides age groups. Here we are saying that anyone is considered dead at the age of 100 (not 100 years after having the transplant).
# Make dataset with age in
n_lt20 <- sum(ebmt3$age == '<=20')
n_gt20 <- sum(ebmt3$age == '20-40')
n_gt40 <- sum(ebmt3$age == '>40')
ebmt3$age_cont <- 0
ebmt3$age_cont[ebmt3$age == '<=20'] <- runif(n_lt20, 1, 20)
ebmt3$age_cont[ebmt3$age == '20-40'] <- runif(n_gt20, 21, 40)
ebmt3$age_cont[ebmt3$age == '>40'] <- runif(n_gt40, 40, 80)
sim5 <- cohort_simulation(models, ebmt3, tmat,
agelimit=36525, agecol='age_cont')
The maximum state entry time is now 96 years, which means they died at the age of 100, showing that the hard limit is working.
NB: these arguments are also in the predict_transitions
and length_of_stay
functions, although they are less useful there.
sim5 %>%
arrange(desc(time)) %>%
head()
#> id age dissub age_cont state time
#> 1 2137 <=20 CML 4.036259 oldage 35050.76
#> 2 21 <=20 AML 6.600455 oldage 34114.18
#> 3 2020 <=20 CML 8.131360 oldage 33555.02
#> 4 514 <=20 CML 8.141476 oldage 33551.33
#> 5 1969 <=20 AML 9.125054 oldage 33192.07
#> 6 458 <=20 AML 11.920130 oldage 32171.17
msrep2
A large part of working with multi-state models involves converting raw data into a format suitable for transition-specific analysis. The mstate
package provides the msprep
function to aid with this; it converts from a wide data frame where each row corresponds to a given individual, to a long based format where each row relates to a possible state transition (observed or not).
As an example using the same ebmt3
dataset, we have the initial dataset in a wide format with: - a patient identifier (id
) - two possible states that can be entered, specified by entry time (prtime
and rfstime
) and entry indicator (prstat
, rfsstat
) - individual level covariates (dissub
, age
, drmatch
, tcd
).
head(ebmt3)
#> id prtime prstat rfstime rfsstat dissub age drmatch tcd
#> 1 1 23 1 744 0 CML >40 Gender mismatch No TCD
#> 2 2 35 1 360 1 CML >40 No gender mismatch No TCD
#> 3 3 26 1 135 1 CML >40 No gender mismatch No TCD
#> 4 4 22 1 995 0 AML 20-40 No gender mismatch No TCD
#> 5 5 29 1 422 1 AML 20-40 No gender mismatch No TCD
#> 6 6 38 1 119 1 ALL >40 No gender mismatch No TCD
#> age_cont
#> 1 75.70497
#> 2 68.21887
#> 3 55.11338
#> 4 37.66485
#> 5 33.54305
#> 6 74.45950
msprep
then uses the transition matrix to form a data frame with each possible transition in the rows, for example, rows 1 and 2 reflect that individual 1 was in state 1 at time 0 and moved into state 2 at 23 days, thereby censoring the transition from 1->3 at the same timepoint.
long <- msprep(time=c(NA, 'prtime', 'rfstime'),
status=c(NA, 'prstat', 'rfsstat'),
data=ebmt3,
trans=tmat,
keep=c('age', 'dissub'))
head(long)
#> An object of class 'msdata'
#>
#> Data:
#> id from to trans Tstart Tstop time status age dissub
#> 1 1 1 2 1 0 23 23 1 >40 CML
#> 2 1 1 3 2 0 23 23 0 >40 CML
#> 3 1 2 3 3 23 744 721 0 >40 CML
#> 4 2 1 2 1 0 35 35 1 >40 CML
#> 5 2 1 3 2 0 35 35 0 >40 CML
#> 6 2 2 3 3 35 360 325 1 >40 CML
This is a very useful function since it saves a lot of time munging the data and is used in every multi-state modelling related analysis I do.
However, it does have one slight limitation, in that the required wide format of the input data isn’t necessarily a natural way of organising state entry data. See Wickham (2014) for a discussion of what makes data ‘tidy’, but in this situation the unit of observation is a state entry and so this is what should be recorded on the rows, not necessarily an individual. Often I have to spend time converting from my raw data, where each row corresponds to a state entry, to this wide format before msprep
can be used. Furthermore, having one column per state means that this function doesn’t allow for reversible Markov chains, where a person enters the same state more than once.
To address this, multistateutils
provides an alternative version of msprep
that accepts data in long format. This function, unimaginitivly called msprep2
, requires a data frame with 3 columns: id
, state
, and time
, so that each individual has as many rows as they have state entries.
Let’s show an example for the first 2 patients in ebmt3
:
pr
at time 23 and has last follow-up at \(t=744\).rfs
at time 35 before entering rfs
at \(t=360\).ebmt3 %>% filter(id %in% 1:2)
#> id prtime prstat rfstime rfsstat dissub age drmatch tcd
#> 1 1 23 1 744 0 CML >40 Gender mismatch No TCD
#> 2 2 35 1 360 1 CML >40 No gender mismatch No TCD
#> age_cont
#> 1 75.70497
#> 2 68.21887
In long format this is more straightforward and keeps the fields to a minimum, helping to focus on the states that actually are visited.
entry <- data.frame(id=c(1, 2, 2),
state=c(2, 2, 3),
time=c(23, 35, 360))
entry
#> id state time
#> 1 1 2 23
#> 2 2 2 35
#> 3 2 3 360
Passing this into msprep2
produces an output that looks similar, but not identical to the one from msprep
. The discrepancy is in patient 1, as their right censored transition from state 2->3 is no longer included.
msprep2(entry, tmat)
#> An object of class 'msdata'
#>
#> Data:
#> id from to trans Tstart Tstop time status
#> 1 1 1 2 1 0 23 23 1
#> 2 1 1 3 2 0 23 23 0
#> 3 2 1 2 1 0 35 35 1
#> 4 2 1 3 2 0 35 35 0
#> 5 2 2 3 3 35 360 325 1
Censored observations are included by means of supplying a data frame to the censors
argument with fields: id
and censor_time
. Note that below we only add a value for patient 1, since we have complete follow-up on patient 2.
This is cleaner than msprep
where all states that aren’t visited need to have a censored observation time supplied, even if a patient has entered a sink state, while here only a single last follow-up time per patient is required.
cens <- data.frame(id=1, censor_time=744)
msprep2(entry, tmat, censors = cens)
#> An object of class 'msdata'
#>
#> Data:
#> id from to trans Tstart Tstop time status
#> 1 1 1 2 1 0 23 23 1
#> 2 1 1 3 2 0 23 23 0
#> 3 1 2 3 3 23 744 721 0
#> 4 2 1 2 1 0 35 35 1
#> 5 2 1 3 2 0 35 35 0
#> 6 2 2 3 3 35 360 325 1
The final difference from the msprep
output is the lack of covaries. Like with the censoring times, these are parameterised by a data frame indexed by id
, with the remaining columns being any covariate of interest.
This method of supplying three separate tidy data frames for the state entry times, censor times, and covariates is consistent with how data is stored in relational databases and so should be familiar to most people already.
covars <- ebmt3 %>% filter(id %in% 1:2) %>% select(id, age, dissub)
msprep2(entry, tmat, censors = cens, covars = covars)
#> An object of class 'msdata'
#>
#> Data:
#> id from to trans Tstart Tstop time status age dissub
#> 1 1 1 2 1 0 23 23 1 >40 CML
#> 2 1 1 3 2 0 23 23 0 >40 CML
#> 3 1 2 3 3 23 744 721 0 >40 CML
#> 4 2 1 2 1 0 35 35 1 >40 CML
#> 5 2 1 3 2 0 35 35 0 >40 CML
#> 6 2 2 3 3 35 360 325 1 >40 CML
An additional benefit of using a long data frame for state entries is that it allows reversible transitions. As a quick demonstration, let us consider an extension of the illness-death model where a person can be cured, i.e. transition back from illness->healthy.
states <- c('healthy', 'illness', 'death')
tmat2 <- matrix(c(NA, 3, NA, 1, NA, NA, 2, 4, NA), nrow=3, ncol=3,
dimnames=list(states, states))
tmat2
#> healthy illness death
#> healthy NA 1 2
#> illness 3 NA 4
#> death NA NA NA
I’ll generate two individuals:
multistate_entry <- data.frame(id=c(rep(1, 2),
rep(2, 4)),
state=c('illness', 'death',
'illness', 'healthy', 'illness', 'death'),
time=c(6, 11,
7, 12, 17, 22))
multistate_entry
#> id state time
#> 1 1 illness 6
#> 2 1 death 11
#> 3 2 illness 7
#> 4 2 healthy 12
#> 5 2 illness 17
#> 6 2 death 22
And as can be seen below, this works with msprep2
.
msprep2(multistate_entry, tmat2)
#> An object of class 'msdata'
#>
#> Data:
#> id from to trans Tstart Tstop time status
#> 1 1 1 2 1 0 6 6 1
#> 2 1 1 3 2 0 6 6 0
#> 3 1 2 1 3 6 11 5 0
#> 4 1 2 3 4 6 11 5 1
#> 5 2 1 2 1 0 7 7 1
#> 6 2 1 3 2 0 7 7 0
#> 7 2 2 1 3 7 12 5 1
#> 8 2 2 3 4 7 12 5 0
#> 9 2 1 2 1 12 17 5 1
#> 10 2 1 3 2 12 17 5 0
#> 11 2 2 1 3 17 22 5 0
#> 12 2 2 3 4 17 22 5 1
de Wreede, Liesbeth C, Marta Fiocco, and Hein Putter. 2011. “Mstate: An R Package for the Analysis of Competing Risks and Multi-State Models.” Journal of Statistical Software 38.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59.