mice
and psfmi
This vignette shows how to develop, internally and externally validate
a (logistic) regression prediction model with mice
and psfmi
.
You can install the released version of psfmi with:
install.packages("psfmi")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("mwheymans/psfmi")
You can install the released version of mice with:
install.packages("mice")
Back to Steps
I generated 5 imputed datasets to handle the missing values in the lbp_orig dataset that is
included in the psfmi
package. Depending on the amount of missing data, the number of imputed
datasets may be increased (see on-line book:
Applied Missing Data analysis).
library(mice)
library(psfmi)
imp <- mice(lbp_orig, m=5, maxit=10)
##
## iter imp variable
## 1 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 1 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 1 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 1 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 1 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 2 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 2 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 2 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 2 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 2 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 3 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 3 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 3 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 3 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 3 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 4 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 4 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 4 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 4 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 4 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 5 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 5 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 5 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 5 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 5 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 6 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 6 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 6 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 6 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 6 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 7 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 7 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 7 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 7 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 7 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 8 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 8 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 8 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 8 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 8 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 9 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 9 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 9 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 9 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 9 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 10 1 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 10 2 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 10 3 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 10 4 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
## 10 5 Carrying Pain Tampascale Function Radiation Age Satisfaction JobControl JobDemands SocialSupport
Use the complete
function of the mice
package to extract the completed datasets. By
setting action = "long"
and include = FALSE
, the imputed datasets are stacked under
each other to form one long dataset and the original dataset (that included missing
values) is not included in that long dataset.
library(mice)
data_comp <- complete(imp, action = "long", include = FALSE)
Back to Steps
Use the psfmi_lr
function in the psfmi
package to perform backward selection over
the 5 multiply imputed datasets with a p-value of 0.05 and as selection method D1.
Note that with this function it is possible to include cubic spline coefficients in case
of non-linear relationships during backward selection and that predictors may be forced
in the model during backward selection by including them at the setting keep.predictors
.
When the setting direction
is changed to FW, forward selection is done.
library(psfmi)
pool_lr <- psfmi_lr(data=data_comp, nimp=5, impvar=".imp", Outcome="Chronic",
predictors=c("Gender", "Age", "JobControl", "Tampascale", "Pain", "Radiation",
"JobDemands", "SocialSupport", "Smoking"), cat.predictors = c("Satisfaction", "Carrying"),
spline.predictors = "Function", nknots = 3,
keep.predictors = "Radiation", p.crit = 0.157, method="D1", direction = "BW")
## Removed at Step 1 is - Gender
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Smoking
## Removed at Step 4 is - Age
## Removed at Step 5 is - rcs(Function,3)
## Removed at Step 6 is - JobControl
## Removed at Step 7 is - SocialSupport
## Removed at Step 8 is - Tampascale
##
## Selection correctly terminated,
## No more variables removed from the model
pool_lr$RR_model_final
## $`Step 9`
## term estimate std.error statistic df p.value
## 1 (Intercept) -5.4555864 1.0533349 -5.1793466 80.89631 1.596456e-06
## 2 Pain 0.9066812 0.1912409 4.7410420 36.76855 3.169734e-05
## 3 Radiation 0.4783856 0.4964348 0.9636423 130.77651 3.370038e-01
## 4 factor(Satisfaction)2 -0.5236336 0.6191642 -0.8457104 62.08012 4.009623e-01
## 5 factor(Satisfaction)3 -2.7018731 0.7981299 -3.3852547 95.26055 1.034237e-03
## 6 factor(Carrying)2 1.2435674 0.6128183 2.0292596 92.09083 4.531842e-02
## 7 factor(Carrying)3 1.7770774 0.7317224 2.4286223 40.12301 1.973018e-02
## OR lower.EXP upper.EXP
## 1 0.004272371 0.000542066 0.0336733
## 2 2.476091328 1.702076603 3.6020872
## 3 1.613467475 0.609798549 4.2690775
## 4 0.592364189 0.176013335 1.9935724
## 5 0.067079751 0.014034907 0.3206073
## 6 3.467963196 1.043357183 11.5269909
## 7 5.912550841 1.409029900 24.8101601
pool_lr$multiparm_final
## $`Step 9`
## p-values D1 F-statistic
## Pain 1.532075e-05 22.4774796
## Radiation 3.353720e-01 0.9286065
## factor(Satisfaction) 2.483896e-03 6.1898977
## factor(Carrying) 3.699533e-02 3.4226311
The same result can be obtained by using the formula
setting in the psfmi_lr
function
library(psfmi)
pool_lr <- psfmi_lr(data=data_comp, nimp=5, impvar=".imp",
formula = Chronic ~ Gender + Age + JobControl + Tampascale + Pain + Radiation +
JobDemands + SocialSupport + Smoking + factor(Satisfaction) + factor(Carrying) +
rcs(Function, 3), keep.predictors = "Radiation",
p.crit = 0.157, method="D1", direction = "BW")
## Removed at Step 1 is - Gender
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Smoking
## Removed at Step 4 is - Age
## Removed at Step 5 is - rcs(Function,3)
## Removed at Step 6 is - JobControl
## Removed at Step 7 is - SocialSupport
## Removed at Step 8 is - Tampascale
##
## Selection correctly terminated,
## No more variables removed from the model
pool_lr$RR_model_final
## $`Step 9`
## term estimate std.error statistic df p.value
## 1 (Intercept) -5.4555864 1.0533349 -5.1793466 80.89631 1.596456e-06
## 2 Pain 0.9066812 0.1912409 4.7410420 36.76855 3.169734e-05
## 3 Radiation 0.4783856 0.4964348 0.9636423 130.77651 3.370038e-01
## 4 factor(Satisfaction)2 -0.5236336 0.6191642 -0.8457104 62.08012 4.009623e-01
## 5 factor(Satisfaction)3 -2.7018731 0.7981299 -3.3852547 95.26055 1.034237e-03
## 6 factor(Carrying)2 1.2435674 0.6128183 2.0292596 92.09083 4.531842e-02
## 7 factor(Carrying)3 1.7770774 0.7317224 2.4286223 40.12301 1.973018e-02
## OR lower.EXP upper.EXP
## 1 0.004272371 0.000542066 0.0336733
## 2 2.476091328 1.702076603 3.6020872
## 3 1.613467475 0.609798549 4.2690775
## 4 0.592364189 0.176013335 1.9935724
## 5 0.067079751 0.014034907 0.3206073
## 6 3.467963196 1.043357183 11.5269909
## 7 5.912550841 1.409029900 24.8101601
pool_lr$multiparm_final
## $`Step 9`
## p-values D1 F-statistic
## Pain 1.532075e-05 22.4774796
## Radiation 3.353720e-01 0.9286065
## factor(Satisfaction) 2.483896e-03 6.1898977
## factor(Carrying) 3.699533e-02 3.4226311
Back to Steps
To internally validate the model we use the psfmi_perform
function. With this function five different
methods can be used to internally validate models in MI data see these Vignettes,
Three methods use cross-validation and two bootstrapping in combination with MI.
We will use cross-validation and more specific the method cv_MI_RR. With this method it is possible to integrate backward selection into the cross-validation procedure. The last model that is selected by the psfmi_lr function is internally validated. So, if we want to apply backward selection during cross-validation from the full model we first have to apply the psfmi_lr function without variable selection. That is the example that we want to use here, because we know that variable selection is one of the main reasons that prediction models are over-fitted.
library(psfmi)
pool_val <- psfmi_lr(data=data_comp, formula = Chronic ~ Gender + Age + JobControl + Tampascale +
Pain + Radiation + JobDemands + SocialSupport + Smoking + factor(Satisfaction) +
factor(Carrying) + rcs(Function, 3), p.crit = 1, direction="BW",
nimp=5, impvar=".imp", method="D1")
set.seed(200)
res_cv <- psfmi_perform(pool_val, val_method = "cv_MI_RR", data_orig = lbp_orig, folds = 3,
p.crit=0.05, BW=TRUE, nimp_mice = 10, miceImp = miceImp, printFlag = FALSE)
##
## fold 1
## Removed at Step 1 is - rcs(Function,3)
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Smoking
## Removed at Step 4 is - SocialSupport
## Removed at Step 5 is - JobControl
## Removed at Step 6 is - Age
## Removed at Step 7 is - Tampascale
## Removed at Step 8 is - Gender
## Removed at Step 9 is - Radiation
##
## Selection correctly terminated,
## No more variables removed from the model
##
## fold 2
## Removed at Step 1 is - JobDemands
## Removed at Step 2 is - Smoking
## Removed at Step 3 is - JobControl
## Removed at Step 4 is - Age
## Removed at Step 5 is - Radiation
## Removed at Step 6 is - Tampascale
## Removed at Step 7 is - rcs(Function,3)
## Removed at Step 8 is - Gender
## Removed at Step 9 is - factor(Carrying)
## Removed at Step 10 is - SocialSupport
##
## Selection correctly terminated,
## No more variables removed from the model
##
## fold 3
## Removed at Step 1 is - Age
## Removed at Step 2 is - JobDemands
## Removed at Step 3 is - Radiation
## Removed at Step 4 is - Tampascale
## Removed at Step 5 is - Gender
## Removed at Step 6 is - Smoking
## Removed at Step 7 is - rcs(Function,3)
## Removed at Step 8 is - SocialSupport
## Removed at Step 9 is - JobControl
## Removed at Step 10 is - factor(Satisfaction)
## Removed at Step 11 is - factor(Carrying)
##
## Selection correctly terminated,
## No more variables removed from the model
res_cv
## $stats
## Train Test
## AUC 0.8866677 0.8377989
## Brier scaled 0.4771826 0.2751471
## Rsq 0.5484889 0.4137824
##
## $slope
## Intercept Slope
## -0.009177862 0.864415077
Back to Steps
We can use the slope value of 0.864415077 that was estimated at the previous step as a shrinkage factor to prevent our model from being overfitted in new data. We do this by multiplying the pooled coefficients with the shrinkage factor and also to determine a new intercept value that is aligned with the shrunken coefficients.
We use the pool_intadj
function for this that will provide the adjusted coefficients and
new intercept value.
library(psfmi)
res <- pool_intadj(pool_lr, shrinkage_factor = 0.864415077)
res$int_adj
## [1] -4.74758
res$coef_shrink_pooled
## Pain Radiation factor(Satisfaction)2
## 0.7837489 0.4135237 -0.4526368
## factor(Satisfaction)3 factor(Carrying)2 factor(Carrying)3
## -2.3355398 1.0749584 1.5361325
The last step is to externally validate this adjusted model.
Back to Steps
We validate the model in an external dataset that is imputed five times due to missing data.
library(psfmi)
res_extval <- mivalext_lr(data.val = lbpmi_extval, nimp = 5, impvar = "Impnr",
Outcome = "Chronic", predictors = pool_lr$predictors_final,
lp.orig = c(res$int_adj, res$coef_shrink_pooled),
cal.plot = TRUE)
##
## Pooled performance measures over m = 5 imputed external validation datasets correctly estimated
res_extval$ROC
## $`ROC (logit)`
## 95% Low AUC 95% Up
## AUC (logit) 0.7365 0.8477 0.9173
##
## $`ROC (median)`
## 1st Qu. Median 3rd Qu.
## 0.84678 0.84772 0.85366
res_extval$R2_fixed
## $`Fisher Z (fixed)`
## [1] 0.45526
##
## $`Median (fixed)`
## 1st Qu. Median 3rd Qu.
## 0.45742 0.45831 0.47240
res_extval$HLtest
## D p df df2
## 0.89517 0.52245 8.00000 138.64820
res_extval$LP_pooled_ext
## intercept slope
## 0.00589 0.87747
In the external dataset the AUC is 0.84772 and the R-squared 0.45. The Hosmer and Lemeshow test has a p-value of 0.63168, which means that the fit is satisfactory. This is also confirmed on the calibration plot where the calibration curves in each imputed dataset are near the optimal (dashed) line. The slope (regression coefficients) is has a value of 0.86526 and slightly deviates from the value of 1, which means that the coefficients values differ between the development and external dataset.
Back to Steps