--- title: "Choice prediction" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Choice prediction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ../inst/REFERENCES.bib link-citations: yes --- ```{r, label = setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "img/", fig.align = "center", fig.dim = c(8, 6), out.width = "75%" ) library("RprobitB") options("RprobitB_progress" = FALSE) ``` This vignette^[This vignette is built using R `r paste(R.Version()$major, R.Version()$minor, sep = ".")` with the `{RprobitB}` `r utils::packageVersion("RprobitB")` package.] discusses in-sample and out-of-sample prediction within `{RprobitB}`. The former case refers to reproducing the observed choices on the basis of the covariates and the fitted model and subsequently using the deviations between prediction and reality as an indicator for the model performance. The latter means forecasting choice behavior for changes in the choice attributes. For illustration, we revisit the probit model of travelers deciding between two fictional train route alternatives from [the vignette on model fitting](https://loelschlaeger.de/RprobitB/articles/v03_model_fitting.html): ```{r, message = FALSE} form <- choice ~ price + time + change + comfort | 0 data <- prepare_data(form = form, choice_data = train_choice, id = "deciderID", idc = "occasionID") model_train <- fit_model( data = data, scale = "price := -1" ) ``` ## Reproducing the observed choices `{RprobitB}` provides a `predict()` method for `RprobitB_fit` objects. Per default, the method returns a confusion matrix, which gives an overview of the in-sample prediction performance: ```{r, predict-model-train} predict(model_train) ``` By setting the argument `overview = FALSE`, the method instead returns predictions on the level of individual choice occasions: ```{r, predict-model-train-indlevel} pred <- predict(model_train, overview = FALSE) head(pred, n = 10) ``` Among the three incorrect predictions shown here, the one for decider `id = 1` in choice occasion `idc = 8` is the most outstanding. Alternative `B` was chosen although the model predicts a probability of 76\% for alternative `A`. We can use the convenience function `get_cov()` to extract the characteristics of this particular choice situation: ```{r, model-train-covs} get_cov(model_train, id = 1, idc = 8) ``` The trip option `A` was 20€ cheaper and 30 minutes faster, which by our model outweighs the better comfort class for alternative `B`, recall the estimated effects: ```{r, model-train-coeffs} coef(model_train) ``` The misclassification can be explained by preferences that differ from the average decider (choice behaviour heterogeneity), or by unobserved factors that influenced the choice. Indeed, the variance of the error term was estimated high: ```{r, model-train-Sigma} point_estimates(model_train)$Sigma ``` Apart from the prediction accuracy, the model performance can be evaluated more nuanced in terms of sensitivity and specificity. The following snippet exemplary shows how to visualize these measures by means of a receiver operating characteristic (ROC) curve [@Fawcett2006], comparing the prediction performance to a model that only uses the price as explanatory variable: ```{r, roc-example, warning = FALSE, message = FALSE, out.width = "50%", fig.dim = c(6,6)} model_train_basic <- update(model_train, form = choice ~ price | 0) plot_roc(model_train, model_train_basic) ``` The prediction performance of `model_train` is better, because its ROC curve lies above. ## Forecasting choice behavior The `predict()` method has an additional `data` argument. Per default, `data = NULL`, which results into the in-sample case outlined above. Alternatively, `data` can be either - an `RprobitB_data` object (for example the test subsample derived from the `train_test()` function, see [the vignette on choice data](https://loelschlaeger.de/RprobitB/articles/v02_choice_data.html)), - or a data frame of custom choice characteristics. We demonstrate the second case in the following. Assume that a train company wants to anticipate the effect of a price increase on their market share. By our model, increasing the ticket price from 100€ to 110€ (ceteris paribus) draws 15\% of the customers to the competitor who does not increase their prices. ```{r, predict-model-train-given-covs-1} predict( model_train, data = data.frame( "price_A" = c(100, 110), "price_B" = c(100, 100) ), overview = FALSE ) ``` However, offering a better comfort class compensates for the higher price and even results in a gain of 7\% market share: ```{r, predict-model-train-given-covs-2} predict( model_train, data = data.frame( "price_A" = c(100, 110), "comfort_A" = c(1, 0), "price_B" = c(100, 100), "comfort_B" = c(1, 1) ), overview = FALSE ) ``` ## References