MachineShop
is a meta-package for statistical and machine learning with a common interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes and for resample (bootstrap and cross-validation) estimation of model performance. This vignette introduces the package interface with a survival data analysis example, followed by applications to other types of response variables, supported methods of model specification and data preprocessing, and a list of all currently available models.
The lung
dataset from the survival
package (Therneau 2015) contains time, in days, to death or censoring for advanced lung cancer patients from the North Central Cancer Treatment Group. Also provided are potential predictors of the survival outcomes. We begin by loading the MachineShop
and survival
packages required for the analysis as well as the magrittr
package (Bache and Wickham 2014) for its pipe (%>%
) operator to simplify some of the code syntax. The dataset is split into a training set to which a survival model will be fit and a test set on which to make predictions. A global formula fo
relates the predictors on the right hand side to the survival outcome on the left and will be used in all of the survival models in this vignette example.
## Load libraries for the survival analysis
library(MachineShop)
library(survival)
library(magrittr)
## Lung cancer dataset
head(lung)
#> inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 1 3 306 2 74 1 1 90 100 1175 NA
#> 2 3 455 2 68 1 0 90 90 1225 15
#> 3 3 1010 1 56 1 0 90 90 NA 15
#> 4 5 210 2 57 1 1 90 60 1150 11
#> 5 1 883 2 60 1 0 100 90 NA 0
#> 6 12 1022 1 74 1 1 50 80 513 0
## Create training and test sets
n <- nrow(lung) * 2 / 3
train <- head(lung, n)
test <- head(lung, -n)
## Global formula for the analysis
fo <- Surv(time, status) ~ age + sex + ph.ecog + ph.karno + pat.karno +
meal.cal + wt.loss
Generalized boosted regression models are a tree-based ensemble method that can applied to survival outcomes. They are available in the MachineShop
with the function GBMModel
. A call to the function creates an instance of the model containing any user-specified model parameters and internal machinery for model fitting, prediction, and performance assessment. Created models can be supplied to the fit
function to estimate a relationship (fo
) between predictors and an outcome based on a set of data (train
). The importance of variables in a model fit is estimated with the varimp
function and plotted with plot
. Variable importance is a measure of the relative importance of predictors in a model and has a default range of 0 to 100, where 0 corresponds to the least important variables and 100 the most.
## Fit a generalized boosted model
gbmfit <- fit(fo, data = train, model = GBMModel)
## Predictor variable importance
(vi <- varimp(gbmfit))
#> Overall
#> wt.loss 100.0000000
#> meal.cal 39.4443729
#> pat.karno 10.8370000
#> ph.ecog 9.7782767
#> sex 6.2603557
#> age 0.3483733
#> ph.karno 0.0000000
plot(vi)
From the model fit, predictions are obtained at 180, 360, and 540 days as survival probabilities (type = "prob"
) and as 0-1 death indicators (default: type = "response"
).
## Predict survival probabilities and outcomes at specified follow-up times
times <- c(180, 360, 540)
predict(gbmfit, newdata = test, times = times, type = "prob") %>% head
#> [,1] [,2] [,3]
#> [1,] 0.9227150 0.8436564 0.7558962
#> [2,] 0.9245074 0.8471239 0.7610171
#> [3,] 0.9408301 0.8790474 0.8087967
#> [4,] 0.8860744 0.7744102 0.6565031
#> [5,] 0.9403041 0.8780090 0.8072246
#> [6,] 0.9109190 0.8210223 0.7228045
predict(gbmfit, newdata = test, times = times) %>% head
#> [,1] [,2] [,3]
#> [1,] 0 0 0
#> [2,] 0 0 0
#> [3,] 0 0 0
#> [4,] 0 0 0
#> [5,] 0 0 0
#> [6,] 0 0 0
A call to modelmetrics
with observed and predicted outcomes will produce model performance metrics. The metrics produced will depend on the type of the observed variable. In this case of a Surv
variable, the metrics are area under the ROC curve (Heagerty, Lumley, and Pepe 2004) and Brier score (Graf et al. 1999) at the specified times and their time-integrated averages.
## Model performance metrics
obs <- response(fo, test)
pred <- predict(gbmfit, newdata = test, times = times, type = "prob")
modelmetrics(obs, pred, times = times)
#> ROC Brier ROCTime.1 ROCTime.2 ROCTime.3 BrierTime.1
#> 1 0.6959056 0.3454537 0.7112623 0.6578397 0.7186147 0.2846098
#> BrierTime.2 BrierTime.3
#> 1 0.3622301 0.3895211
The performance of a model can be estimated with resampling methods that simulate repeated training and test set fits and prediction. Performance metrics are computed on each resample to produce an empirical distribution for inference. Resampling is controlled in the MachineShop
with the functions:
In our example, performance of models to predict survival at 180, 360, and 540 days will be estimated with five repeats of 10-fold cross-validation. Variable metrics
is defined for the purpose of reducing the printed and plotted output in this vignette to only the time-integrated ROC and Brier metrics. Such subsetting of output would not be done in practice if there is interest in looking at all metrics.
## Control parameters for repeated K-fold cross-validation
control <- CVControl(
folds = 10,
repeats = 5,
surv_times = c(180, 360, 540)
)
## Metrics of interest
metrics <- c("ROC", "Brier")
Resampling of a single model is performed with the resample
function applied to a model object (e.g. GBMModel()
) and a control object like the one defined previously (control
). Summary statistics and plots can be obtained with the summary
and plot
functions.
## Resample estimation
(perf <- resample(fo, data = lung, model = GBMModel, control = control))
#> An object of class "Resamples"
#>
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#>
#> method: Repeated 10-Fold CV
#>
#> resamples: 50
summary(perf)
#> Mean Median SD Min Max NA
#> ROC 0.6267948 0.6307367 0.10759045 0.29833578 0.8868144 0.00
#> Brier 0.3735395 0.3777492 0.06289406 0.24572389 0.4755521 0.02
#> ROCTime.1 0.6547395 0.6477513 0.13877138 0.37500000 0.9333333 0.00
#> ROCTime.2 0.6230899 0.6160392 0.13589783 0.21214427 0.8961790 0.00
#> ROCTime.3 0.6025551 0.6142401 0.14464529 0.07810162 0.8326104 0.00
#> BrierTime.1 0.2431842 0.2476103 0.07860418 0.07330819 0.4452448 0.00
#> BrierTime.2 0.4176099 0.4186352 0.08034191 0.25954089 0.5941881 0.00
#> BrierTime.3 0.4612867 0.4446287 0.09330402 0.29572566 0.6213501 0.02
plot(perf, metrics = metrics)
Resampled metrics from different models can be combined for comparison with the Resamples
function. Names given on the left hand side of the equal operators in the call to Resamples
will be used as labels in output from the summary
and plot
functions. For these types of model comparisons, the same control structure should be used in all associated calls to resample
to ensure that the resulting model metrics are computed on the same resampled training and test sets.
## Resample estimation
gbmperf1 <- resample(fo, data = lung, model = GBMModel(n.trees = 25), control = control)
gbmperf2 <- resample(fo, data = lung, model = GBMModel(n.trees = 50), control = control)
gbmperf3 <- resample(fo, data = lung, model = GBMModel(n.trees = 100), control = control)
## Combine resamples for comparison
(perf <- Resamples(GBM1 = gbmperf1, GBM2 = gbmperf2, GBM3 = gbmperf3))
#> An object of class "Resamples"
#>
#> models: GBM1, GBM2, GBM3
#>
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#>
#> method: Repeated 10-Fold CV
#>
#> resamples: 50
summary(perf)[, , metrics]
#> , , ROC
#>
#> Mean Median SD Min Max NA
#> GBM1 0.6208897 0.6284438 0.1143623 0.2576244 0.8863976 0
#> GBM2 0.6264682 0.6269215 0.1107560 0.2831705 0.8925768 0
#> GBM3 0.6267948 0.6307367 0.1075904 0.2983358 0.8868144 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> GBM1 0.3134875 0.3145324 0.04911131 0.2180042 0.4255814 0.02
#> GBM2 0.3500086 0.3383746 0.05953035 0.2183535 0.4563990 0.02
#> GBM3 0.3735395 0.3777492 0.06289406 0.2457239 0.4755521 0.02
plot(perf, metrics = metrics)
plot(perf, metrics = metrics, type = "density")
plot(perf, metrics = metrics, type = "errorbar")
plot(perf, metrics = metrics, type = "violin")
Pairwise model differences for each metric can be calculated with the diff
function applied to results from a call to Resamples
. The differences can be summarized descriptively with the summary
and plot
functions and assessed for statistical significance with the t.test
function.
## Pairwise model comparisons
(perfdiff <- diff(perf))
#> An object of class "ResamplesDiff"
#>
#> models: GBM1 - GBM2, GBM1 - GBM3, GBM2 - GBM3
#>
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#>
#> method: Repeated 10-Fold CV
#>
#> resamples: 50
summary(perfdiff)[, , metrics]
#> , , ROC
#>
#> Mean Median SD Min Max NA
#> GBM1 - GBM2 -0.0055784617 -0.006718618 0.04574029 -0.13181536 0.1330280 0
#> GBM1 - GBM3 -0.0059051071 -0.002523448 0.05806972 -0.17844529 0.1253541 0
#> GBM2 - GBM3 -0.0003266453 0.002907231 0.03992677 -0.08206796 0.1026186 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> GBM1 - GBM2 -0.03652101 -0.03199356 0.03060978 -0.1226825 0.020745129 0.02
#> GBM1 - GBM3 -0.06005199 -0.05578357 0.04065655 -0.1866302 0.009489262 0.02
#> GBM2 - GBM3 -0.02353098 -0.02496482 0.03496532 -0.1155204 0.051309803 0.02
plot(perfdiff, metrics = metrics)
t.test(perfdiff)[, , metrics]
#> , , ROC
#>
#> GBM1 GBM2 GBM3
#> GBM1 NA -0.005578462 -0.0059051071
#> GBM2 1 NA -0.0003266453
#> GBM3 1 1.000000000 NA
#>
#> , , Brier
#>
#> GBM1 GBM2 GBM3
#> GBM1 NA -3.652101e-02 -0.06005199
#> GBM2 1.301632e-10 NA -0.02353098
#> GBM3 2.519127e-13 2.141914e-05 NA
Modelling functions may have arguments that define parameters in their model fitting algorithms. For example, GBMModel
has arguments n.trees
, interaction.dept
, and n.minobsinnode
that defined the number of decision trees to fit, the maximum depth of variable interactions, and the minimum number of observations in the trees terminal nodes. The tune
function is available in the MachineShop
to fit a model over a grid of parameters and return the model whose parameters provide the optimal fit. Note that the function name GBMModel
, and not the function call GBMModel()
, is supplied as the first argument to tune
. Summary statistics and plots of performance across all tuning parameters are available with the summary
and plot
functions.
## Tune over a grid of model parameters
(gbmtune <- tune(fo, data = lung, model = GBMModel,
grid = expand.grid(n.trees = c(25, 50, 100),
interaction.depth = 1:3,
n.minobsinnode = c(5, 10)),
control = control))
#> An object of class "MLModelTune"
#>
#> name: GBMModel
#>
#> packages: gbm
#>
#> types: factor, numeric, Surv
#>
#> params:
#> $n.trees
#> [1] 100
#>
#> $interaction.depth
#> [1] 1
#>
#> $n.minobsinnode
#> [1] 5
#>
#> $shrinkage
#> [1] 0.1
#>
#> $bag.fraction
#> [1] 0.5
#>
#> grid:
#> n.trees interaction.depth n.minobsinnode
#> 1 25 1 5
#> 2 50 1 5
#> 3 100 1 5
#> 4 25 2 5
#> 5 50 2 5
#> 6 100 2 5
#> 7 25 3 5
#> 8 50 3 5
#> 9 100 3 5
#> 10 25 1 10
#> 11 50 1 10
#> 12 100 1 10
#> 13 25 2 10
#> 14 50 2 10
#> 15 100 2 10
#> 16 25 3 10
#> 17 50 3 10
#> 18 100 3 10
#>
#> resamples:
#> An object of class "Resamples"
#>
#> models: Model1, Model2, Model3, Model4, Model5, Model6, Model7, Model8, Model9, Model10, Model11, Model12, Model13, Model14, Model15, Model16, Model17, Model18
#>
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#>
#> method: Repeated 10-Fold CV
#>
#> resamples: 50
#>
#> selected: Model3 (ROC)
summary(gbmtune)[, , metrics]
#> , , ROC
#>
#> Mean Median SD Min Max NA
#> Model1 0.6475240 0.6529951 0.11671315 0.2879752 0.8751367 0
#> Model2 0.6505354 0.6461596 0.10758236 0.3029592 0.8807992 0
#> Model3 0.6528556 0.6571132 0.10787394 0.3283268 0.8860066 0
#> Model4 0.6517832 0.6535397 0.10278249 0.3096476 0.8595345 0
#> Model5 0.6456896 0.6373818 0.09997261 0.2897306 0.8446246 0
#> Model6 0.6463744 0.6501941 0.10308284 0.3241649 0.8942996 0
#> Model7 0.6428911 0.6472990 0.09984359 0.2928429 0.8355988 0
#> Model8 0.6428660 0.6509541 0.09600838 0.2909384 0.8213401 0
#> Model9 0.6335341 0.6419372 0.09803981 0.3237794 0.8305131 0
#> Model10 0.6208897 0.6284438 0.11436234 0.2576244 0.8863976 0
#> Model11 0.6264682 0.6269215 0.11075597 0.2831705 0.8925768 0
#> Model12 0.6267948 0.6307367 0.10759045 0.2983358 0.8868144 0
#> Model13 0.6321796 0.6209458 0.10071954 0.3235082 0.8556783 0
#> Model14 0.6336356 0.6393767 0.10794477 0.2791071 0.8818145 0
#> Model15 0.6313388 0.6276999 0.10502512 0.3147146 0.8697500 0
#> Model16 0.6314421 0.6286821 0.10192158 0.3279006 0.8789439 0
#> Model17 0.6279223 0.6121055 0.10352823 0.2877773 0.8334346 0
#> Model18 0.6242168 0.6300677 0.09928675 0.3058928 0.8255478 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> Model1 0.2541298 0.2404764 0.05242713 0.1830501 0.4749716 0.02
#> Model2 0.2931855 0.2809531 0.06986178 0.1827537 0.5229491 0.02
#> Model3 0.3218424 0.3157772 0.07806414 0.1817510 0.4654257 0.02
#> Model4 0.2943096 0.2943692 0.07439639 0.1575971 0.4806878 0.02
#> Model5 0.3291921 0.3158336 0.07284854 0.2156741 0.4847558 0.02
#> Model6 0.3577765 0.3467007 0.07903774 0.1873953 0.5556421 0.02
#> Model7 0.3101199 0.2972278 0.08925197 0.1596756 0.5528642 0.02
#> Model8 0.3391010 0.3256895 0.08915093 0.1790966 0.5723771 0.02
#> Model9 0.3512441 0.3302284 0.08963545 0.1789306 0.5639761 0.02
#> Model10 0.3134875 0.3145324 0.04911131 0.2180042 0.4255814 0.02
#> Model11 0.3500086 0.3383746 0.05953035 0.2183535 0.4563990 0.02
#> Model12 0.3735395 0.3777492 0.06289406 0.2457239 0.4755521 0.02
#> Model13 0.2755510 0.2649040 0.06558740 0.1578267 0.4730075 0.02
#> Model14 0.2766338 0.2677355 0.06644221 0.1646165 0.5146766 0.02
#> Model15 0.2794591 0.2672761 0.06935673 0.1746263 0.4539657 0.02
#> Model16 0.2695364 0.2584030 0.06157188 0.1722330 0.4021857 0.02
#> Model17 0.2756926 0.2707400 0.06401660 0.1767296 0.4567698 0.02
#> Model18 0.2773934 0.2589818 0.06720444 0.1688597 0.4609474 0.02
plot(gbmtune, type = "line", metrics = metrics)
The value returned by tune
contains an object produced by a call to the modelling function with the the optimal tuning parameters. Thus, the value can be passed on to the fit
function for model fitting to a set of data.
## Fit the tuned model
gbmfit <- fit(fo, data = lung, model = gbmtune)
(vi <- varimp(gbmfit))
#> Overall
#> wt.loss 100.00000
#> pat.karno 43.55028
#> meal.cal 38.52489
#> age 28.13958
#> ph.ecog 16.35750
#> sex 10.21999
#> ph.karno 0.00000
plot(vi)
Resampling is implemented with the foreach
package (Microsoft and Weston 2017b) and will run in parallel if a compatible backend is loaded, such as that provided by the doParallel
package (Microsoft and Weston 2017a).
library(doParallel)
registerDoParallel(cores = 4)
Categorical responses with two or more levels should be code as a factor
variable for analysis. The type of metrics return will depend on the number of factor levels. Metrics for factors with two levels are as follows.
cutoff_index
in the resampling control functions (default: Sensitivity + Specificity). The function allows for specification of tradeoffs (Perkins and Schisterman 2006) other than the default of Youden’s J statistic (Youden 1950).
Brier, ROCAUC, and PRAUC are computed directly on predicted class probabilities. The others are computed on predicted class membership. Memberships are defined to be in the second factor level if predicted probabilities are greater than a cutoff value defined in the resampling control functions (default: cutoff = 0.5
).
### Pima Indians diabetes statuses (2 levels)
library(MASS)
perf <- resample(factor(type) ~ ., data = Pima.tr, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> Accuracy 0.7097494 0.7184211 0.09864029 0.55000000 0.8571429 0
#> Kappa 0.3472411 0.3282669 0.20598468 0.10000000 0.6590909 0
#> Brier 0.1864286 0.1717006 0.07133543 0.08129439 0.2865481 0
#> ROCAUC 0.7969911 0.7957875 0.12508751 0.59340659 0.9795918 0
#> PRAUC 0.5882814 0.6117009 0.14717694 0.36183021 0.8234127 0
#> Sensitivity 0.5476190 0.5357143 0.16723260 0.28571429 0.8333333 0
#> Specificity 0.7934066 0.8076923 0.13154213 0.53846154 1.0000000 0
#> Index 1.3410256 1.3104396 0.20038338 1.10989011 1.6373626 0
Metrics for factors with three or more levels are as described below.
MLogLoss is computed directly on predicted class probabilities. The others are computed on predicted class membership, defined as the factor level with the highest predicted probability.
### Iris flowers species (3 levels)
perf <- resample(factor(Species) ~ ., data = iris, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> Accuracy 0.9400000 0.9333333 0.04919099 0.866666667 1.0000000 0
#> Kappa 0.9100000 0.9000000 0.07378648 0.800000000 1.0000000 0
#> MLogLoss 0.2749864 0.1705714 0.25599219 0.004360594 0.6291907 0
Numerical responses should be coded as a numeric
variable. Associated performance metrics are as defined below and illustrated with Boston housing price data (Venables and Ripley 2002).
### Boston housing prices
library(MASS)
perf <- resample(medv ~ ., data = Boston, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> R2 0.8164724 0.8268029 0.07777862 0.6836031 0.9048452 0
#> RMSE 3.8012284 3.9477530 0.71651929 2.8476595 4.7160359 0
#> MAE 2.6372293 2.5795521 0.35453397 2.2049741 3.2084473 0
Survival responses should be coded as a Surv
variable. In addition to the ROC and Brier survival metrics described earlier in the vignette, the concordance index (Harrell et al. 1982) can be obtained if follow-up times are not specified for the prediction.
## Censored lung cancer survival times
library(survival)
perf <- resample(Surv(time, status) ~ ., data = lung, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> CIndex 0.6223534 0.6098921 0.06900829 0.4901961 0.7169811 0
Model specification here refers to the relationship between the response and predictor variables and the data used to estimate it. Three main types of specification are supported by the fit
, resample
, and tune
functions: formulas, model frames, and recipes.
Models may be specified with the traditional formula and data frame pair, as was done in the previous examples. In this specification, in-line functions, interactions, and .
substitution of variables not already appearing in the formula may be include.
## Formula specification
gbmfit <- fit(medv ~ ., data = Boston, model = GBMModel)
varimp(gbmfit)
#> Overall
#> lstat 100.0000000
#> rm 78.7690494
#> nox 10.1299088
#> dis 9.6524457
#> ptratio 7.9191006
#> crim 7.6073951
#> tax 1.6476096
#> black 0.8751824
#> chas 0.7372595
#> age 0.2985201
#> zn 0.0000000
#> indus 0.0000000
#> rad 0.0000000
The second specification is similar to the first, except the formula and data frame pair are give in a model.frame
. The model frame approach has a few subtle advantages. One is that cases with missing values on any of the response or predictor variables are excluded from the model frame by default. This is often desirable for models that cannot handle missing values. Note, however, that some models like GBMModel
do accommodate missing values. For those, missing values can be retained in the model frame by setting its argument na.action = NULL
.
## Model frame specification
mf <- model.frame(medv ~ ., data = Boston)
gbmfit <- fit(mf, model = GBMModel)
varimp(gbmfit)
#> Overall
#> lstat 100.0000000
#> rm 86.2898780
#> nox 10.3718501
#> dis 8.8483151
#> ptratio 7.8626588
#> crim 6.2201116
#> chas 1.4484297
#> tax 1.4133295
#> black 0.8554703
#> rad 0.4481856
#> age 0.2085257
#> zn 0.0000000
#> indus 0.0000000
Another advantage is that case weights can be included in the model frame and will be passed on to the model fitting functions in the MachineShop
.
## Model frame specification with case weights
mf <- model.frame(ncases / (ncases + ncontrols) ~ agegp + tobgp + alcgp,
data = esoph, weights = ncases + ncontrols)
gbmfit <- fit(mf, model = GBMModel)
varimp(gbmfit)
#> Overall
#> alcgp 100.00000
#> agegp 82.88254
#> tobgp 0.00000
The recipes
package (Kuhn and Wickham 2018) provides a framework for defining predictor and response variables and preprocessing steps to be applied to them prior to model fitting. Using recipes helps to ensure that estimation of predictive performance accounts for all modeling step. They are also a very convenient way of consistently applying preprocessing to new data. Recipes currently support factor
and numeric
responses, but not generally Surv
.
## Recipe specification
library(recipes)
rec <- recipe(medv ~ ., data = Boston) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(all_predictors())
gbmfit <- fit(rec, model = GBMModel)
varimp(gbmfit)
#> Overall
#> PC1 100.00000
#> PC3 59.74319
#> PC5 14.51863
#> PC4 14.42464
#> PC2 0.00000
Currently available model functions are summarized in the table below according to the types of response variables with which each model can be used. The package additionally supplies a generic MLModel
function for users to create their own custom models.
factor | numeric | ordered | Surv | |
---|---|---|---|---|
C50Model | x | |||
CForestModel | x | x | x | |
CoxModel | x | |||
CoxStepAICModel | x | |||
GLMModel | x | x | ||
GLMStepAICModel | x | x | ||
GBMModel | x | x | x | |
GLMNetModel | x | x | x | |
NNetModel | x | x | ||
PLSModel | x | x | ||
POLRModel | x | |||
RandomForestModel | x | x | ||
SurvRegModel | x | |||
SurvRegStepAICModel | x | |||
SVMModel | x | x |
Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.
Graf, E, C Schmoor, W Sauerbrei, and M Schumacher. 1999. “Assessment and Comparison of Prognostic Classification Schemes for Survival Data.” Statistics in Medicine 18 (17–18): 2529–45.
Harrell, FE, RM Califf, DB Pryor, KL Lee, and RA Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247 (18): 2543–6.
Heagerty, PJ, T Lumley, and MS Pepe. 2004. “Time-Dependent Roc Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics 56 (2): 337–44.
Kuhn, Max, and Hadley Wickham. 2018. Recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.
Microsoft, and Steve Weston. 2017a. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.
———. 2017b. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.
Perkins, Neil J., and Enrique F. Schisterman. 2006. “The Inconsistency of ‘Optimal’ Cutpoints Obtained Using Two Criteria Based on the Receiver Operating Characteristic Curve.” American Journal of Epidemiology 163 (7): 670–75.
Therneau, Terry M. 2015. A Package for Survival Analysis in S. https://CRAN.R-project.org/package=survival.
Venables, WN, and BD Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.
Youden, WJ. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35.