pencal
is an R
package that has been created to make it easy and efficient to estimate and apply Penalized Regression Calibration.
Penalized Regression Calibration (PRC) is a statistical method that allows to compute predictions for a survival outcome using a set of high-dimensional and longitudinal covariates as predictors.
The methodological details behind PRC are described in detail in the following article:
Signorelli, M., Spitali, P., Al-Khalili Sgyziarto, C., The Mark-MD Consortium, Tsonaka, R. (in review). NB: an arXiv preprint will be available soon!
In short, PRC comprises three modelling steps:
These steps allow to estimate PRC, and to obtain a model that can be used to compute predicted survival probabilities.
Additionally, one may want to quantify the predictive performance of the fitted model. To achieve this aim, in pencal
we have implemented a Cluster Bootstrap Optimism Correction Procedure (CBOCP) that can be used to obtain optimism-corrected estimates of the C index and time-dependent AUC associated to the fitted model. Depending on the dimensionality of your dataset, computing the CBOCP might be time consuming; for this reason, we offer the possibility to parallelize the CBOCP using multiple cores.
Below you can see a graphical representation of the steps involved in the estimation of PRC (see the elements in the lightblue box) and in the computation of the CBOCP (elements in the salmon box).
Estimation of the PRC-LMM model described in Signorelli et al. (in review) can be performed using the following three functions:
fit_lmms
, which implements the first step of the estimation of the PRC-LMM;summarize_lmms
, which carries out the second step;fit_prclmm
, which performs the third step.These functions are run sequentially, with the output of fit_lmms
used as input for summarize_lmms
, and the output of summarize_lmms
as input for fit_prclmm
.
Lastly, the function survpred_prclmm
computes survival probabilities based on the fitted PRC-LMM.
Most of the computations required by the CBOCP are performed by fit_lmms
, summarize_lmms
and fit_prclmm
. Such computations may be time-consuming, and for this reason these functions make it possible to use parallel computing (this can be easily done with the argument n.cores
). The last step of the CBOCP is performed by the function performance_prclmm
, which returns the naive and optimism-corrected estimates of the C index and of the time-dependent AUC.
Important note: if you just want to estimate the PRC model, and you do not wish to compute the CBOCP, simply set n.boots = 0
as argument of fit_lmms
. If, instead, you do want to compute the CBOCP, set n.boots
to the desired number of bootstrap samples (e.g., 100).
In addition to the functions mentioned above, pencal
comprises also two functions that can be used to simulate example datasets:
simulate_t_weibull
to simulate survival data from a Weibull model;simulate_prclmm_data
to simulate an example dataset for PRC-LMM which is comprehensive of a number of longitudinal biomarkers, a survival outcome and a censoring indicator.To illustrate how pencal
works, let us simulate an example dataset that comprises \(n = 100\) subjects, \(p = 10\) longitudinal biomarkers that are measured at \(t = 0, 0.2, 0.5, 1, 1.5, 2\) years from baseline, and a survival outcome that is associated with 5 (p.relev
) of the 10 biomarkers:
set.seed(1234)
p = 10
simdata = simulate_prclmm_data(n = 100, p = p, p.relev = 5,
lambda = 0.2, nu = 1.5,
seed = 1234, t.values = c(0, 0.2, 0.5, 1, 1.5, 2))
ls(simdata)
## [1] "censoring.prop" "long.data" "surv.data"
Note that in this example we are setting \(n > p\), but pencal
can handle both low-dimensional (\(n > p\)) and high-dimensional (\(n \leq p\)) datasets.
In order to estimate the PRC-LMM, you need to provide the following two inputs:
id
, the longitudinal biomarkers (here called marker1
, …, marker10
), and the relevant time variables (in this example we will use age
as covariate in the LMMs estimated in step 1, and baseline.age
as covariate in the penalized Cox model estimated in step 3):## id base.age t.from.base age marker1 marker2 marker3 marker4
## 1 1 4.269437 0.0 4.269437 1.5417408 4.452282 15.13419 5.809207
## 2 1 4.269437 0.2 4.469437 1.2346437 5.252873 15.98066 5.896591
## 3 1 4.269437 0.5 4.769437 2.0773929 3.714174 18.44501 6.634897
## 4 1 4.269437 1.0 5.269437 0.2137868 4.092887 20.23194 6.179779
## 5 1 4.269437 1.5 5.769437 1.3354611 5.032044 18.99531 5.914160
## 6 1 4.269437 2.0 6.269437 0.7811953 4.946483 22.19621 5.981212
## marker5 marker6 marker7 marker8 marker9 marker10
## 1 13.62807 -7.041495 15.19982 10.12011 2.7166023 15.16749
## 2 15.03320 -5.763194 16.25356 10.20624 1.2764132 13.11855
## 3 14.90197 -6.478355 17.40369 11.74692 1.9369628 13.91899
## 4 16.48330 -8.994558 18.44549 11.91967 1.5949944 15.50285
## 5 16.38560 -9.034169 19.61104 12.59247 1.3730259 15.86172
## 6 17.23651 -10.220797 19.71013 12.79891 -0.2164965 15.89041
# visualize the trajectories for a randomly picked biomarker
library(ptmixed)
ptmixed::make.spaghetti(x = age, y = marker5,
id = id, group = id,
data = simdata$long.data,
margins = c(4, 4, 2, 2),
legend.inset = - 1)
id
, the time to event outcome called time
, and the binary event indicator called event
(NB: make sure that the variable names associated to these three variables are indeed id
, time
and event
!)## id baseline.age time event
## 1 1 4.269437 0.8368389 0
## 2 2 4.705434 1.4288656 1
## 3 3 3.220979 1.6382975 1
## 4 4 4.379400 0.5809532 1
## 5 5 4.800329 0.2441706 1
## 6 6 3.394879 0.4901404 1
## [1] 0.22
## Loading required package: ggplot2
## Loading required package: ggpubr
surv.obj = survival::Surv(time = simdata$surv.data$time,
event = simdata$surv.data$event)
kaplan = survival::survfit(surv.obj ~ 1,
type="kaplan-meier")
survminer::ggsurvplot(kaplan, data = simdata$surv.data)
Hereafter we show how to implement the three steps involved in the estimation of the PRC-LMM, alongside with the computation of the CBOCP.
pencal
Before doing that, let’s determine the number of cores that will be used for the computation of the CBOCP. In general you can use as many cores as available to you; to do this, you can set
Since the CRAN Repository Policy allow us to use at most 2 cores when building the vignettes, in this example we will limit the number of cores used to 2:
Be aware, however, that using more than 2 cores will speed computations up, and it is thus recommended. Several functions in pencal
will actually return a warning when you perform computations using less cores than available: the goal of such warnings is to remind you that you could use more cores to speed computations up; however, if you are purposedly using a smaller number of cores you can ignore the warning.
In the first step, for each biomarker we estimate a linear mixed model (LMM) where the longitudinal biomarker levels \(y_{ij}\) depend on two fixed effects (one intercept, \(\beta_0\) and one slope for age, \(\beta_1\)), on a subject-specific random intercept \(u_{0i}\) and on a random slope for age \(u_{1i}\):
\[y_{ij} = \beta_0 + u_{0i} + \beta_1 a_{ij} + u_{1i} a_{ij} + \varepsilon_{ij}.\]
To do this in R
we use the fit_lmms
function:
y.names = paste('marker', 1:p, sep = '')
step1 = fit_lmms(y.names = y.names,
fixefs = ~ age, ranefs = ~ age | id,
long.data = simdata$long.data,
surv.data = simdata$surv.data,
t.from.base = t.from.base,
n.boots = 10, n.cores = n.cores)
## Sorting long.data by subject id
## Sorting surv.data by subject id
## Preliminary step: remove measurements taken after event / censoring.
## Removed: 209 measurements. Retained: 391 measurements.
## Estimating the LMMs on the original dataset...
## ...done
## Bootstrap procedure started
## This computation will be run in parallel, using 2 cores
## Bootstrap procedure finished
## Computation of step 1: finished :)
Note that here I have set n.boots = 10
to reduce computing time for the CBOCP, given that CRAN only allows me to use two cores.
In general, it is recommended to set n.boots = 0
if you do not wish to compute the CBOCP, or to set n.boots
equal to a larger number (e.g., 50, 100 or 200) if you want to accurately compute the CBOCP. In the latter case, consider using as many cores as available to you to speed the computations up.
fit_lmms
returns as output a list with several elements; among them is lmm.fits.orig
, which contains the LMMs fitted to each biomarker:
## [1] "boot.ids" "call.info" "df.sanitized" "lmm.fits.boot"
## [5] "lmm.fits.orig" "n.boots"
## $marker1
## Linear mixed-effects model fit by REML
## Data: df.sub
## Log-restricted-likelihood: -678.8922
## Fixed: fixef.formula
## (Intercept) age
## 3.491217 -1.025154
##
## Random effects:
## Formula: ~age | id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 1.4809966 (Intr)
## age 0.8241438 0.391
## Residual 0.7262114
##
## Number of Observations: 391
## Number of Groups: 100
For more details about the arguments of fit_lmms
and its outputs, see the help page: ?fit_lmms
.
In the second step we compute the predicted random intercepts and random slopes for the LMMs fitted in step 1:
## Computing the predicted random effects on the original dataset...
## ...done
## Bootstrap procedure started
## This computation will be run in parallel, using 2 cores
## Bootstrap procedure finished
## Computation of step 2: finished :)
summarize_lmms
returns as output a list that contains, among other elements, a matrix ranef.orig
with the predicted random effects for the LMMs fitted in step 1:
## [1] "boot.ids" "call" "n.boots" "ranef.boot.train"
## [5] "ranef.boot.valid" "ranef.orig"
## marker1_b_int marker1_b_age marker2_b_int marker2_b_age
## 1 0.468791151 0.50257974 0.4806176 0.9897538
## 2 -0.651187418 -1.18949489 0.3094290 -1.2798246
## 3 0.541710892 1.35639622 -1.1855801 -0.4835886
## 4 1.065842498 0.82801424 -0.4332657 0.1949000
## 5 0.002714736 -0.09192979 -0.4224219 -1.4788431
For more details about the arguments of summarize_lmms
and its outputs, see the help page: ?summarize_lmms
.
Lastly, in the third step of PRC-LMM we estimate a penalized Cox model where we employ as predictors baseline age and all the summaries (predicted random effects) computed in step 2:
step3 = fit_prclmm(object = step2, surv.data = simdata$surv.data,
baseline.covs = ~ baseline.age,
penalty = 'ridge', n.cores = n.cores)
## Estimated penalized Cox model on the original dataset...
## ...done
## Bootstrap procedure started
## This computation will be run in parallel, using 2 cores
## Bootstrap procedure finished
## Computation of step 3: finished :)
Note that here we have specified penalty = 'ridge'
, but alternatively one may also use elasticnet or lasso as penalties. Moreover, by default the predicted random effects are standardized when included in the penalized Cox model (if you don’t want to perform such standardization, set standardize = F
).
fit_prclmm
returns as output a list that contains, among other elements, the fitted penalized Cox model pcox.orig
, which is a glmnet
object:
## [1] "boot.ids" "call" "n.boots" "pcox.boot" "pcox.orig" "surv.data"
## [1] "cv.glmnet"
## Loading required package: Matrix
## Loaded glmnet 4.0-2
## baseline.age marker1_b_int marker1_b_age marker2_b_int marker2_b_age
## 1 0.001253804 -0.02091294 -0.005629194 0.05632732 0.1530983
## marker3_b_int marker3_b_age marker4_b_int marker4_b_age marker5_b_int
## 1 0.103437 0.06736506 -0.0103572 -0.0005451742 -0.3558578
## marker5_b_age marker6_b_int marker6_b_age marker7_b_int marker7_b_age
## 1 -0.074645 2.170329 0.02189938 0.03473504 0.01001565
## marker8_b_int marker8_b_age marker9_b_int marker9_b_age marker10_b_int
## 1 -0.3016531 0.02321143 -0.1515156 -0.03053159 0.02916538
## marker10_b_age
## 1 0.001877107
For more details about the arguments of fit_prclmm
and its outputs, see the help page: ?fit_prclmm
.
After fitting the model, you will probably want to obtain predicted survival probabilities for each individual at several time points. This can be done through the function survpred_prclmm
, which takes as inputs the outputs of step 2 and step 3, alongside with the time points at which to compute the survival probabilities:
## id S(1) S(2) S(3)
## 1 1 0.4685893 0.1666528 0.08151238
## 2 2 0.7506514 0.5076432 0.38729468
## 3 3 0.7088022 0.4432699 0.32036642
## 4 4 0.5153831 0.2087025 0.11167056
## 5 5 0.6207733 0.3239896 0.20662124
## 6 6 0.5630938 0.2572870 0.14965965
To accurately quantify the predicted performance of the fitted PRC-LMM, we need to recur to some form of internal validation strategy (e.g., bootstrap, cross-validation, etc…).
In pencal
the internal validation is performed through a Cluster Bootstrap Optimism Correction Procedure (CBOCP) that allows to compute optimism-corrected estimates of the concordance (C) index and of the time-dependent AUC.
Most of the steps that the CBOCP requires are directly computed by the functions fit_lmms
, summarize_lmms
and fit_prclmm
whenever the argument n.boots
of fit_lmms
is set equal to an integer > 0 (in other words: most of the computations needed for the CBOCP have already been performed in the code chunks executed above, so we are almost done!).
To gather the results of the CBOCP we can use the function performance_prclmm
:
## Computation of optimism correction started
## This computation will be run in parallel, using 2 cores
## Computation of the optimism correction: finished :)
## n.boots C.naive cb.opt.corr C.adjusted
## 1 10 0.7855 -0.0397 0.7458
## pred.time tdAUC.naive cb.opt.corr tdAUC.adjusted
## 1 1 0.8874 -0.0386 0.8488
## 2 2 0.8337 -0.0442 0.7895
## 3 3 0.8303 -0.0674 0.7629
From the results above we can see that:
The aim of this vignette is to provide a quick-start introduction to the R
package pencal
. Here I have focused my attention on the fundamental aspects that one needs to use the package.
Further details, functions and examples can be found in the manual of the package.