Introduction to the MachineShop Package

Brian J Smith

2018-10-02

The MachineShop Package

MachineShop is a meta-package for statistical and machine learning with a common interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes and for resample (bootstrap and cross-validation) estimation of model performance. This vignette introduces the package interface with a survival data analysis example, followed by applications to other types of response variables, supported methods of model specification and data preprocessing, and a list of all currently available models.

Model Fitting and Prediction

The lung dataset from the survival package (Therneau 2015) contains time, in days, to death or censoring for advanced lung cancer patients from the North Central Cancer Treatment Group. Also provided are potential predictors of the survival outcomes. We begin by loading the MachineShop and survival packages required for the analysis as well as the magrittr package (Bache and Wickham 2014) for its pipe (%>%) operator to simplify some of the code syntax. The dataset is split into a training set to which a survival model will be fit and a test set on which to make predictions. A global formula fo relates the predictors on the right hand side to the survival outcome on the left and will be used in all of the survival models in this vignette example.

## Load libraries for the survival analysis
library(MachineShop)
library(survival)
library(magrittr)

## Lung cancer dataset
head(lung)
#>   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
#> 1    3  306      2  74   1       1       90       100     1175      NA
#> 2    3  455      2  68   1       0       90        90     1225      15
#> 3    3 1010      1  56   1       0       90        90       NA      15
#> 4    5  210      2  57   1       1       90        60     1150      11
#> 5    1  883      2  60   1       0      100        90       NA       0
#> 6   12 1022      1  74   1       1       50        80      513       0

## Create training and test sets
n <- nrow(lung) * 2 / 3
train <- head(lung, n)
test <- head(lung, -n)

## Global formula for the analysis
fo <- Surv(time, status) ~ age + sex + ph.ecog + ph.karno + pat.karno +
                           meal.cal + wt.loss

Generalized boosted regression models are a tree-based ensemble method that can applied to survival outcomes. They are available in the MachineShop with the function GBMModel. A call to the function creates an instance of the model containing any user-specified model parameters and internal machinery for model fitting, prediction, and performance assessment. Created models can be supplied to the fit function to estimate a relationship (fo) between predictors and an outcome based on a set of data (train). The importance of variables in a model fit is estimated with the varimp function and plotted with plot. Variable importance is a measure of the relative importance of predictors in a model and has a default range of 0 to 100, where 0 corresponds to the least important variables and 100 the most.

## Fit a generalized boosted model
gbmfit <- fit(fo, data = train, model = GBMModel)

## Predictor variable importance
(vi <- varimp(gbmfit))
#>               Overall
#> wt.loss   100.0000000
#> meal.cal   39.4443729
#> pat.karno  10.8370000
#> ph.ecog     9.7782767
#> sex         6.2603557
#> age         0.3483733
#> ph.karno    0.0000000

plot(vi)

From the model fit, predictions are obtained at 180, 360, and 540 days as survival probabilities (type = "prob") and as 0-1 death indicators (default: type = "response").

## Predict survival probabilities and outcomes at specified follow-up times
times <- c(180, 360, 540)
predict(gbmfit, newdata = test, times = times, type = "prob") %>% head
#>           [,1]      [,2]      [,3]
#> [1,] 0.9227150 0.8436564 0.7558962
#> [2,] 0.9245074 0.8471239 0.7610171
#> [3,] 0.9408301 0.8790474 0.8087967
#> [4,] 0.8860744 0.7744102 0.6565031
#> [5,] 0.9403041 0.8780090 0.8072246
#> [6,] 0.9109190 0.8210223 0.7228045

predict(gbmfit, newdata = test, times = times) %>% head
#>      [,1] [,2] [,3]
#> [1,]    0    0    0
#> [2,]    0    0    0
#> [3,]    0    0    0
#> [4,]    0    0    0
#> [5,]    0    0    0
#> [6,]    0    0    0

A call to modelmetrics with observed and predicted outcomes will produce model performance metrics. The metrics produced will depend on the type of the observed variable. In this case of a Surv variable, the metrics are area under the ROC curve (Heagerty, Lumley, and Pepe 2004) and Brier score (Graf et al. 1999) at the specified times and their time-integrated averages.

## Model performance metrics
obs <- response(fo, test)
pred <- predict(gbmfit, newdata = test, times = times, type = "prob")
modelmetrics(obs, pred, times = times)
#>         ROC     Brier ROCTime.1 ROCTime.2 ROCTime.3 BrierTime.1
#> 1 0.6959056 0.3454537 0.7112623 0.6578397 0.7186147   0.2846098
#>   BrierTime.2 BrierTime.3
#> 1   0.3622301   0.3895211

Resample Estimation of Model Performance

The performance of a model can be estimated with resampling methods that simulate repeated training and test set fits and prediction. Performance metrics are computed on each resample to produce an empirical distribution for inference. Resampling is controlled in the MachineShop with the functions:

BootControl
Simple bootstrap resampling. Models are fit with bootstrap resampled training sets and used to predict the full data set.
CVControl
Repeated K-fold cross-validation. The full data set is repeatedly partitioned into K-folds. Within a partitioning, prediction is performed on each of the K folds with models fit on all remaining folds.
OOBControl
Out-of-bag bootstrap resampling. Models are fit with bootstrap resampled training sets and used to predict the unsampled cases.

In our example, performance of models to predict survival at 180, 360, and 540 days will be estimated with five repeats of 10-fold cross-validation. Variable metrics is defined for the purpose of reducing the printed and plotted output in this vignette to only the time-integrated ROC and Brier metrics. Such subsetting of output would not be done in practice if there is interest in looking at all metrics.

## Control parameters for repeated K-fold cross-validation
control <- CVControl(
  folds = 10,
  repeats = 5,
  surv_times = c(180, 360, 540)
)

## Metrics of interest
metrics <- c("ROC", "Brier")

Single Model

Resampling of a single model is performed with the resample function applied to a model object (e.g. GBMModel()) and a control object like the one defined previously (control). Summary statistics and plots can be obtained with the summary and plot functions.

## Resample estimation
(perf <- resample(fo, data = lung, model = GBMModel, control = control))
#> An object of class "Resamples"
#> 
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#> 
#> method: Repeated 10-Fold CV
#> 
#> resamples: 50

summary(perf)
#>                  Mean    Median         SD        Min       Max   NA
#> ROC         0.6267948 0.6307367 0.10759045 0.29833578 0.8868144 0.00
#> Brier       0.3735395 0.3777492 0.06289406 0.24572389 0.4755521 0.02
#> ROCTime.1   0.6547395 0.6477513 0.13877138 0.37500000 0.9333333 0.00
#> ROCTime.2   0.6230899 0.6160392 0.13589783 0.21214427 0.8961790 0.00
#> ROCTime.3   0.6025551 0.6142401 0.14464529 0.07810162 0.8326104 0.00
#> BrierTime.1 0.2431842 0.2476103 0.07860418 0.07330819 0.4452448 0.00
#> BrierTime.2 0.4176099 0.4186352 0.08034191 0.25954089 0.5941881 0.00
#> BrierTime.3 0.4612867 0.4446287 0.09330402 0.29572566 0.6213501 0.02

plot(perf, metrics = metrics)

Model Comparisons

Resampled metrics from different models can be combined for comparison with the Resamples function. Names given on the left hand side of the equal operators in the call to Resamples will be used as labels in output from the summary and plot functions. For these types of model comparisons, the same control structure should be used in all associated calls to resample to ensure that the resulting model metrics are computed on the same resampled training and test sets.

## Resample estimation
gbmperf1 <- resample(fo, data = lung, model = GBMModel(n.trees = 25), control = control)
gbmperf2 <- resample(fo, data = lung, model = GBMModel(n.trees = 50), control = control)
gbmperf3 <- resample(fo, data = lung, model = GBMModel(n.trees = 100), control = control)

## Combine resamples for comparison
(perf <- Resamples(GBM1 = gbmperf1, GBM2 = gbmperf2, GBM3 = gbmperf3))
#> An object of class "Resamples"
#> 
#> models: GBM1, GBM2, GBM3
#> 
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#> 
#> method: Repeated 10-Fold CV
#> 
#> resamples: 50

summary(perf)[, , metrics]
#> , , ROC
#> 
#>           Mean    Median        SD       Min       Max NA
#> GBM1 0.6208897 0.6284438 0.1143623 0.2576244 0.8863976  0
#> GBM2 0.6264682 0.6269215 0.1107560 0.2831705 0.8925768  0
#> GBM3 0.6267948 0.6307367 0.1075904 0.2983358 0.8868144  0
#> 
#> , , Brier
#> 
#>           Mean    Median         SD       Min       Max   NA
#> GBM1 0.3134875 0.3145324 0.04911131 0.2180042 0.4255814 0.02
#> GBM2 0.3500086 0.3383746 0.05953035 0.2183535 0.4563990 0.02
#> GBM3 0.3735395 0.3777492 0.06289406 0.2457239 0.4755521 0.02

plot(perf, metrics = metrics)

plot(perf, metrics = metrics, type = "density")

plot(perf, metrics = metrics, type = "errorbar")

plot(perf, metrics = metrics, type = "violin")

Pairwise model differences for each metric can be calculated with the diff function applied to results from a call to Resamples. The differences can be summarized descriptively with the summary and plot functions and assessed for statistical significance with the t.test function.

## Pairwise model comparisons
(perfdiff <- diff(perf))
#> An object of class "ResamplesDiff"
#> 
#> models: GBM1 - GBM2, GBM1 - GBM3, GBM2 - GBM3
#> 
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#> 
#> method: Repeated 10-Fold CV
#> 
#> resamples: 50

summary(perfdiff)[, , metrics]
#> , , ROC
#> 
#>                      Mean       Median         SD         Min       Max NA
#> GBM1 - GBM2 -0.0055784617 -0.006718618 0.04574029 -0.13181536 0.1330280  0
#> GBM1 - GBM3 -0.0059051071 -0.002523448 0.05806972 -0.17844529 0.1253541  0
#> GBM2 - GBM3 -0.0003266453  0.002907231 0.03992677 -0.08206796 0.1026186  0
#> 
#> , , Brier
#> 
#>                    Mean      Median         SD        Min         Max   NA
#> GBM1 - GBM2 -0.03652101 -0.03199356 0.03060978 -0.1226825 0.020745129 0.02
#> GBM1 - GBM3 -0.06005199 -0.05578357 0.04065655 -0.1866302 0.009489262 0.02
#> GBM2 - GBM3 -0.02353098 -0.02496482 0.03496532 -0.1155204 0.051309803 0.02

plot(perfdiff, metrics = metrics)

t.test(perfdiff)[, , metrics]
#> , , ROC
#> 
#>      GBM1         GBM2          GBM3
#> GBM1   NA -0.005578462 -0.0059051071
#> GBM2    1           NA -0.0003266453
#> GBM3    1  1.000000000            NA
#> 
#> , , Brier
#> 
#>              GBM1          GBM2        GBM3
#> GBM1           NA -3.652101e-02 -0.06005199
#> GBM2 1.301632e-10            NA -0.02353098
#> GBM3 2.519127e-13  2.141914e-05          NA

Model Tuning

Modelling functions may have arguments that define parameters in their model fitting algorithms. For example, GBMModel has arguments n.trees, interaction.dept, and n.minobsinnode that defined the number of decision trees to fit, the maximum depth of variable interactions, and the minimum number of observations in the trees terminal nodes. The tune function is available in the MachineShop to fit a model over a grid of parameters and return the model whose parameters provide the optimal fit. Note that the function name GBMModel, and not the function call GBMModel(), is supplied as the first argument to tune. Summary statistics and plots of performance across all tuning parameters are available with the summary and plot functions.

## Tune over a grid of model parameters
(gbmtune <- tune(fo, data = lung, model = GBMModel,
                 grid = expand.grid(n.trees = c(25, 50, 100),
                                    interaction.depth = 1:3,
                                    n.minobsinnode = c(5, 10)),
                 control = control))
#> An object of class "MLModelTune"
#> 
#> name: GBMModel
#> 
#> packages: gbm
#> 
#> types: factor, numeric, Surv
#> 
#> params:
#> $n.trees
#> [1] 100
#> 
#> $interaction.depth
#> [1] 1
#> 
#> $n.minobsinnode
#> [1] 5
#> 
#> $shrinkage
#> [1] 0.1
#> 
#> $bag.fraction
#> [1] 0.5
#> 
#> grid:
#>    n.trees interaction.depth n.minobsinnode
#> 1       25                 1              5
#> 2       50                 1              5
#> 3      100                 1              5
#> 4       25                 2              5
#> 5       50                 2              5
#> 6      100                 2              5
#> 7       25                 3              5
#> 8       50                 3              5
#> 9      100                 3              5
#> 10      25                 1             10
#> 11      50                 1             10
#> 12     100                 1             10
#> 13      25                 2             10
#> 14      50                 2             10
#> 15     100                 2             10
#> 16      25                 3             10
#> 17      50                 3             10
#> 18     100                 3             10
#> 
#> resamples:
#> An object of class "Resamples"
#> 
#> models: Model1, Model2, Model3, Model4, Model5, Model6, Model7, Model8, Model9, Model10, Model11, Model12, Model13, Model14, Model15, Model16, Model17, Model18
#> 
#> metrics: ROC, Brier, ROCTime.1, ROCTime.2, ROCTime.3, BrierTime.1, BrierTime.2, BrierTime.3
#> 
#> method: Repeated 10-Fold CV
#> 
#> resamples: 50
#> 
#> selected: Model3 (ROC)

summary(gbmtune)[, , metrics]
#> , , ROC
#> 
#>              Mean    Median         SD       Min       Max NA
#> Model1  0.6475240 0.6529951 0.11671315 0.2879752 0.8751367  0
#> Model2  0.6505354 0.6461596 0.10758236 0.3029592 0.8807992  0
#> Model3  0.6528556 0.6571132 0.10787394 0.3283268 0.8860066  0
#> Model4  0.6517832 0.6535397 0.10278249 0.3096476 0.8595345  0
#> Model5  0.6456896 0.6373818 0.09997261 0.2897306 0.8446246  0
#> Model6  0.6463744 0.6501941 0.10308284 0.3241649 0.8942996  0
#> Model7  0.6428911 0.6472990 0.09984359 0.2928429 0.8355988  0
#> Model8  0.6428660 0.6509541 0.09600838 0.2909384 0.8213401  0
#> Model9  0.6335341 0.6419372 0.09803981 0.3237794 0.8305131  0
#> Model10 0.6208897 0.6284438 0.11436234 0.2576244 0.8863976  0
#> Model11 0.6264682 0.6269215 0.11075597 0.2831705 0.8925768  0
#> Model12 0.6267948 0.6307367 0.10759045 0.2983358 0.8868144  0
#> Model13 0.6321796 0.6209458 0.10071954 0.3235082 0.8556783  0
#> Model14 0.6336356 0.6393767 0.10794477 0.2791071 0.8818145  0
#> Model15 0.6313388 0.6276999 0.10502512 0.3147146 0.8697500  0
#> Model16 0.6314421 0.6286821 0.10192158 0.3279006 0.8789439  0
#> Model17 0.6279223 0.6121055 0.10352823 0.2877773 0.8334346  0
#> Model18 0.6242168 0.6300677 0.09928675 0.3058928 0.8255478  0
#> 
#> , , Brier
#> 
#>              Mean    Median         SD       Min       Max   NA
#> Model1  0.2541298 0.2404764 0.05242713 0.1830501 0.4749716 0.02
#> Model2  0.2931855 0.2809531 0.06986178 0.1827537 0.5229491 0.02
#> Model3  0.3218424 0.3157772 0.07806414 0.1817510 0.4654257 0.02
#> Model4  0.2943096 0.2943692 0.07439639 0.1575971 0.4806878 0.02
#> Model5  0.3291921 0.3158336 0.07284854 0.2156741 0.4847558 0.02
#> Model6  0.3577765 0.3467007 0.07903774 0.1873953 0.5556421 0.02
#> Model7  0.3101199 0.2972278 0.08925197 0.1596756 0.5528642 0.02
#> Model8  0.3391010 0.3256895 0.08915093 0.1790966 0.5723771 0.02
#> Model9  0.3512441 0.3302284 0.08963545 0.1789306 0.5639761 0.02
#> Model10 0.3134875 0.3145324 0.04911131 0.2180042 0.4255814 0.02
#> Model11 0.3500086 0.3383746 0.05953035 0.2183535 0.4563990 0.02
#> Model12 0.3735395 0.3777492 0.06289406 0.2457239 0.4755521 0.02
#> Model13 0.2755510 0.2649040 0.06558740 0.1578267 0.4730075 0.02
#> Model14 0.2766338 0.2677355 0.06644221 0.1646165 0.5146766 0.02
#> Model15 0.2794591 0.2672761 0.06935673 0.1746263 0.4539657 0.02
#> Model16 0.2695364 0.2584030 0.06157188 0.1722330 0.4021857 0.02
#> Model17 0.2756926 0.2707400 0.06401660 0.1767296 0.4567698 0.02
#> Model18 0.2773934 0.2589818 0.06720444 0.1688597 0.4609474 0.02

plot(gbmtune, type = "line", metrics = metrics)

The value returned by tune contains an object produced by a call to the modelling function with the the optimal tuning parameters. Thus, the value can be passed on to the fit function for model fitting to a set of data.

## Fit the tuned model
gbmfit <- fit(fo, data = lung, model = gbmtune)
(vi <- varimp(gbmfit))
#>             Overall
#> wt.loss   100.00000
#> pat.karno  43.55028
#> meal.cal   38.52489
#> age        28.13958
#> ph.ecog    16.35750
#> sex        10.21999
#> ph.karno    0.00000

plot(vi)

Parallel Computing

Resampling is implemented with the foreach package (Microsoft and Weston 2017b) and will run in parallel if a compatible backend is loaded, such as that provided by the doParallel package (Microsoft and Weston 2017a).

library(doParallel)
registerDoParallel(cores = 4)

Response Variable Types

Categorical

Categorical responses with two or more levels should be code as a factor variable for analysis. The type of metrics return will depend on the number of factor levels. Metrics for factors with two levels are as follows.

Accuracy
Proportion of correctly classified responses.
Kappa
Cohen’s kappa statistic measuring relative agreement between observed and predicted classifications.
Brier
Brier score.
ROCAUC
Area under the ROC curve.
PRAUC
Area under the precision-recall curve.
Sensitivity
Proportion of correctly classified values in the second factor level.
Specificity
Proportion of correctly classified values in the first factor level.
Index
A tradeoff function of sensitivity and specificity as defined by cutoff_index in the resampling control functions (default: Sensitivity + Specificity). The function allows for specification of tradeoffs (Perkins and Schisterman 2006) other than the default of Youden’s J statistic (Youden 1950).

Brier, ROCAUC, and PRAUC are computed directly on predicted class probabilities. The others are computed on predicted class membership. Memberships are defined to be in the second factor level if predicted probabilities are greater than a cutoff value defined in the resampling control functions (default: cutoff = 0.5).

### Pima Indians diabetes statuses (2 levels)
library(MASS)
perf <- resample(factor(type) ~ ., data = Pima.tr, model = GBMModel)
summary(perf)
#>                  Mean    Median         SD        Min       Max NA
#> Accuracy    0.7097494 0.7184211 0.09864029 0.55000000 0.8571429  0
#> Kappa       0.3472411 0.3282669 0.20598468 0.10000000 0.6590909  0
#> Brier       0.1864286 0.1717006 0.07133543 0.08129439 0.2865481  0
#> ROCAUC      0.7969911 0.7957875 0.12508751 0.59340659 0.9795918  0
#> PRAUC       0.5882814 0.6117009 0.14717694 0.36183021 0.8234127  0
#> Sensitivity 0.5476190 0.5357143 0.16723260 0.28571429 0.8333333  0
#> Specificity 0.7934066 0.8076923 0.13154213 0.53846154 1.0000000  0
#> Index       1.3410256 1.3104396 0.20038338 1.10989011 1.6373626  0

Metrics for factors with three or more levels are as described below.

Accuracy
Proportion of correctly classified responses.
Kappa
Cohen’s kappa statistic measuring relative agreement between observed and predicted classifications.
WeightedKappa
Weighted Cohen’s kappa with equally spaced weights. This metric is only computed for ordered factor responses.
MLogLoss
Multinomial logistic loss or cross entropy loss.

MLogLoss is computed directly on predicted class probabilities. The others are computed on predicted class membership, defined as the factor level with the highest predicted probability.

### Iris flowers species (3 levels)
perf <- resample(factor(Species) ~ ., data = iris, model = GBMModel)
summary(perf)
#>               Mean    Median         SD         Min       Max NA
#> Accuracy 0.9400000 0.9333333 0.04919099 0.866666667 1.0000000  0
#> Kappa    0.9100000 0.9000000 0.07378648 0.800000000 1.0000000  0
#> MLogLoss 0.2749864 0.1705714 0.25599219 0.004360594 0.6291907  0

Numerical

Numerical responses should be coded as a numeric variable. Associated performance metrics are as defined below and illustrated with Boston housing price data (Venables and Ripley 2002).

R2
One minus residual divided by total sums of squares, \[R^2 = 1 - \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2},\] where \(y_i\) and \(\hat{y}_i\) are the \(n\) observed and predicted responses.
RMSE
Root mean square error, \[RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2}.\]
MAE
Median absolute error, \[MAE = \operatorname{median}(|y_i - \operatorname{median}(y)|).\]
### Boston housing prices
library(MASS)
perf <- resample(medv ~ ., data = Boston, model = GBMModel)
summary(perf)
#>           Mean    Median         SD       Min       Max NA
#> R2   0.8164724 0.8268029 0.07777862 0.6836031 0.9048452  0
#> RMSE 3.8012284 3.9477530 0.71651929 2.8476595 4.7160359  0
#> MAE  2.6372293 2.5795521 0.35453397 2.2049741 3.2084473  0

Survival

Survival responses should be coded as a Surv variable. In addition to the ROC and Brier survival metrics described earlier in the vignette, the concordance index (Harrell et al. 1982) can be obtained if follow-up times are not specified for the prediction.

## Censored lung cancer survival times
library(survival)
perf <- resample(Surv(time, status) ~ ., data = lung, model = GBMModel)
summary(perf)
#>             Mean    Median         SD       Min       Max NA
#> CIndex 0.6223534 0.6098921 0.06900829 0.4901961 0.7169811  0

Model Specifications

Model specification here refers to the relationship between the response and predictor variables and the data used to estimate it. Three main types of specification are supported by the fit, resample, and tune functions: formulas, model frames, and recipes.

Formulas

Models may be specified with the traditional formula and data frame pair, as was done in the previous examples. In this specification, in-line functions, interactions, and . substitution of variables not already appearing in the formula may be include.

## Formula specification
gbmfit <- fit(medv ~ ., data = Boston, model = GBMModel)
varimp(gbmfit)
#>             Overall
#> lstat   100.0000000
#> rm       78.7690494
#> nox      10.1299088
#> dis       9.6524457
#> ptratio   7.9191006
#> crim      7.6073951
#> tax       1.6476096
#> black     0.8751824
#> chas      0.7372595
#> age       0.2985201
#> zn        0.0000000
#> indus     0.0000000
#> rad       0.0000000

Model Frames

The second specification is similar to the first, except the formula and data frame pair are give in a model.frame. The model frame approach has a few subtle advantages. One is that cases with missing values on any of the response or predictor variables are excluded from the model frame by default. This is often desirable for models that cannot handle missing values. Note, however, that some models like GBMModel do accommodate missing values. For those, missing values can be retained in the model frame by setting its argument na.action = NULL.

## Model frame specification
mf <- model.frame(medv ~ ., data = Boston)
gbmfit <- fit(mf, model = GBMModel)
varimp(gbmfit)
#>             Overall
#> lstat   100.0000000
#> rm       86.2898780
#> nox      10.3718501
#> dis       8.8483151
#> ptratio   7.8626588
#> crim      6.2201116
#> chas      1.4484297
#> tax       1.4133295
#> black     0.8554703
#> rad       0.4481856
#> age       0.2085257
#> zn        0.0000000
#> indus     0.0000000

Another advantage is that case weights can be included in the model frame and will be passed on to the model fitting functions in the MachineShop.

## Model frame specification with case weights
mf <- model.frame(ncases / (ncases + ncontrols) ~ agegp + tobgp + alcgp,
                  data = esoph, weights = ncases + ncontrols)
gbmfit <- fit(mf, model = GBMModel)
varimp(gbmfit)
#>         Overall
#> alcgp 100.00000
#> agegp  82.88254
#> tobgp   0.00000

Recipes

The recipes package (Kuhn and Wickham 2018) provides a framework for defining predictor and response variables and preprocessing steps to be applied to them prior to model fitting. Using recipes helps to ensure that estimation of predictive performance accounts for all modeling step. They are also a very convenient way of consistently applying preprocessing to new data. Recipes currently support factor and numeric responses, but not generally Surv.

## Recipe specification
library(recipes)
rec <- recipe(medv ~ ., data = Boston) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(all_predictors())
gbmfit <- fit(rec, model = GBMModel)
varimp(gbmfit)
#>       Overall
#> PC1 100.00000
#> PC3  59.74319
#> PC5  14.51863
#> PC4  14.42464
#> PC2   0.00000

Available Models

Currently available model functions are summarized in the table below according to the types of response variables with which each model can be used. The package additionally supplies a generic MLModel function for users to create their own custom models.

Response Variable Types
factor numeric ordered Surv
C50Model x
CForestModel x x x
CoxModel x
CoxStepAICModel x
GLMModel x x
GLMStepAICModel x x
GBMModel x x x
GLMNetModel x x x
NNetModel x x
PLSModel x x
POLRModel x
RandomForestModel x x
SurvRegModel x
SurvRegStepAICModel x
SVMModel x x

References

Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.

Graf, E, C Schmoor, W Sauerbrei, and M Schumacher. 1999. “Assessment and Comparison of Prognostic Classification Schemes for Survival Data.” Statistics in Medicine 18 (17–18): 2529–45.

Harrell, FE, RM Califf, DB Pryor, KL Lee, and RA Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247 (18): 2543–6.

Heagerty, PJ, T Lumley, and MS Pepe. 2004. “Time-Dependent Roc Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics 56 (2): 337–44.

Kuhn, Max, and Hadley Wickham. 2018. Recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.

Microsoft, and Steve Weston. 2017a. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.

———. 2017b. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.

Perkins, Neil J., and Enrique F. Schisterman. 2006. “The Inconsistency of ‘Optimal’ Cutpoints Obtained Using Two Criteria Based on the Receiver Operating Characteristic Curve.” American Journal of Epidemiology 163 (7): 670–75.

Therneau, Terry M. 2015. A Package for Survival Analysis in S. https://CRAN.R-project.org/package=survival.

Venables, WN, and BD Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Youden, WJ. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35.