orf: ordered random forests

orf: Introduction

The R package orf is an implementation of the Ordered Forest estimator as in Lechner and Okasa (2019). The Ordered Forest flexibly estimates the conditional probabilities of models with ordered categorical outcomes (so-called ordered choice models). Additionally to common machine learning algorithms the orf package provides functions for estimating marginal effects as well as statistical inference thereof and thus provides similar output as in standard econometric models for ordered choice. The core forest algorithm relies on the fast C++ forest implementation from the ranger package (Wright and Ziegler 2017).

orf: Installation

In order to install the latest CRAN released version use:

install.packages("orf", dependencies = c("Imports", "Suggests"))

to make sure all the needed packages are installed as well. Note that if you install the package directly from the source a C++ compiler is required. For Windows users Rtools collection is required too.

orf: Algorithm

The main function of the package is orf, which implements the Ordered Forest estimator as developed in Lechner and Okasa (2019). The main idea is to provide a flexible alternative to standard econometric ordered choice models (categorical dependent variable with inherent ordering) such as ordered logit or ordered probit while still being able to recover essentially the same output as in the standard parametric models. As such the Ordered Forest estimator not only provides estimates of conditional ordered choice probabilities, i.e. $P[Y=m|X=x]$, but also estimates of marginal effects, i.e. how the conditional probabilities vary with changes in $X$. Further, the orf estimator provides also inference for the marginal effects as well as for the conditional probabilities.

More formally, consider an categorical ordered outcome variable $Y_i \in \{1,...,M \}$. Then, the algorithm can be described as follows:

Algorithm: Ordered Forest

input: Data ($X,Y$)
output: Class Probabilities $\hat{P}[Y=m \mid X=x]$

1 procedure Ordered Forest
- 2 subprocedure Cumulative Probabilities
  - 3 for $m=0,...,M$:
    - 4 create binary indicator variables according to: $Y_{m,i}=\mathbf{1}(Y_i \leq m)$
    - 5 estimate regression random forest for: $P[Y_{m,i}=1 \mid X_i=x]$
    - 6 predict conditional probabilities as: $\hat{Y}_{m,i}=\hat{P}[Y_{m,i}=1 \mid X_i=x]$
  - 7 endfor
- 8 end subprocedure
- 9 subprocedure Class Probabilities
  - 10 for $m=1,...,M$:
    - 11 compute probabilities for each class as: $\hat{P}_{m,i}=\hat{Y}_{m,i}-\hat{Y}_{m-1,i}$
    - 12 if $\hat{P}_{m,i}<0$
      - 13 set $\hat{P}_{m,i}=0$ and $\hat{P}_{m,i}=\frac{\hat{P}_{m,i}}{\sum^{M}_{m=1}\hat{P}_{m,i}}$
    - 14 endif
  - 15 endfor
- 16 end subprocedure
17 end procedure

Hence, the main idea of the Ordered Forest is to firstly, transform the ordered model into multiple overlapping binary models which are estimated by regression forests and thus yield predictions for the cumulative probabilities. Secondly, the estimated cumulative probabilities are differenced to isolate the actual class probabilities. As such, the prediction for the conditional probability of a particular ordered class $m$ is given by subtracting two adjacent cumulative probabilities. Notice that this procedure uses the fact that the cumulative probability over all classes must sum up to unity by definition.

orf: Ordered Forest

The Ordered Forest provided in the orf function estimates the conditional ordered choice probabilities as described by the above algorithm. Additionally, weight-based inference for the probability predictions can be conducted as well. If inference is desired, the Ordered Forest must be estimated with honesty and subsampling. Honesty is defined as in Lechner (2019) and thus refers to the honest forest, instead of the honest tree as is the case in Wager and Athey (2018). This means that the honest split takes place before the forest estimation and not only before the tree estimations. This might somewhat reduce the efficiency of the estimator. However, if prediction only is desired, estimation without honesty and with bootstrapping as in classical random forests by Breiman (2001) is recommended for optimal prediction performance.

In order to estimate the Ordered Forest user must supply the data in form of matrix of covariates ($X$) and a vector of outcomes ($Y$) to the orf function. These data inputs are also the only inputs that must be specified by the user without any defaults. Further optional arguments include the classical forest hyperparameters such as number of trees, num.trees, number of randomly selected features, mtry, and the minimum leaf size, min.node.size. The forest building scheme is regulated by the replace argument, meaning bootstrapping if replace = TRUE or subsampling if replace = FALSE. For the case of subsampling, sample.fraction argument regulates the subsampling rate. Further, honest forest is estimated if the honesty argument is set to TRUE, which is also the default. Similarly, the fraction of the sample used for the honest estimation is regulated by the honesty.fraction argument. The default setting conducts a 50:50 sample split, which is also generally advised to follow for optimal performance. Inference procedure of the Ordered Forest is based on the forest weights as suggested in Lechner and Okasa (2019) and is controlled by the inference argument. Note that such weight-based inference is computationally demanding exercise due to the estimation of the forest weights and as such longer computation time is to be expected. Lastly, the importance argument turns on and off the permutation based variable importance. The variable importance for the Ordered Forest is a simple class-weighted importance of the underlying forests.

Additionally, standard R functions such as summary, predict, or plot are provided as well to facilitate the classical R user experience. Below you will find a few examples on how to use the orf function to estimate the Ordered Forest.

orf: `odata`

First, load an example data included in the orf package. This data includes an ordered categorical outcome variable with 3 distinct ordered classes $Y\in\{1,2,3\}$ with a set of four covariates $X \in \{X1, X2, X3, X4\}$ of different types. The first covariate and the last covariate, i.e. $X1$ and $X4$ are continuous, the second one, $X2$, is ordered categorical and the third one, $X3$, is binary. Furthermore, within the data generating process, covariates $X1$, $X2$ and $X3$ enter in a linear form with a positive effect on the outcome, while $X4$ is without any effect and thus serves as a noise variable in the dataset. For the exact DGP, see ?orf::odata.

# load example data
data(odata)

# specify response and covariates
Y <- as.numeric(odata[, 1])
X <- as.matrix(odata[, -1])

orf: `orf`, `print.orf`, `summary.orf`, `plot.orf`

Now, estimate the Ordered Forest using the orf function with the default settings and supplying only the required data inputs. Print the output of the estimation procedure with the S3 method print.orf.

# estimate Ordered Forest with default settings
orf_model <- orf(X, Y)

# print output of the orf estimation
print(orf_model)
#> Ordered Forest object of class orf 
#> 
#> Number of Categories:             3 
#> Sample Size:                      1000 
#> Number of Trees:                  1000 
#> Build:                            Subsampling 
#> Mtry:                             2 
#> Minimum Node Size:                5 
#> Honest Forest:                    TRUE 
#> Weight-Based Inference:           FALSE

Repeat the orf estimation with custom settings for the hyperparameters and summarize the estimation output with the S3 method summary.orf.

# estimate Ordered Forest with custom settings
orf_model <- orf(X, Y,
                       num.trees = 1000, mtry = 2, min.node.size = 5,
                       replace = FALSE, sample.fraction = 0.5,
                       honesty = TRUE, honesty.fraction = 0.5,
                       inference = FALSE, importance = FALSE)

# show summary of the orf estimation
summary(orf_model)
#> Summary of the Ordered Forest Estimation 
#>                                
#> type             Ordered Forest
#> categories       3             
#> build            Subsampling   
#> num.trees        1000          
#> mtry             2             
#> min.node.size    5             
#> replace          FALSE         
#> sample.fraction  0.5           
#> honesty          TRUE          
#> honesty.fraction 0.5           
#> inference        FALSE         
#> importance       FALSE         
#> trainsize        500           
#> honestsize       500           
#> features         4             
#> mse              0.50931       
#> rps              0.15642

The summary of the estimated Ordered Forest provides the basic information about the estimation and its inputs as well as information about the out-of-bag prediction accuracy measured in terms of the classical mean squared error (MSE) and the probabilistic ranked probability score (RPS). Furthermore, the summary.orf command provides a latex argument which generates a LaTeX coded table for immediate extraction of the results for the research documentation. In addition, the orf object contains further elements that can be accessed with the $\$$ operator.

For a graphical representation of the estimated probabilities plot.orf command plots the probability distributions estimated by the Ordered Forest. The plots visualize the estimated probability density of each outcome class, i.e. $\hat{P}[Y=1\mid X=x]$, $\hat{P}[Y=2\mid X=x]$, and $\hat{P}[Y=3\mid X=x]$ in contrast to the actual observed outcome class and as such provides a visual inspection of the underlying probability predictions for the outcome classes. The dashed lines within the density plots locate the means of the respective probability distributions.

The example below demonstrates the usage of the plot.orf command.

# plot the estimated probability distributions
plot(orf_model)

orf: `predict.orf`, `print.predict.orf`, `summary.predict.orf`

The command predict.orf predicts the conditional choice probabilities for new data points based on the estimated Ordered Forest object. If no new data is supplied to newdata argument, the in-sample fitted values will be returned. The user can additionally specify the type of the predictions. If probability predictions are desired, type = "p" or type = "probs" should be specified (this is also the default). For class predictions, define type = "c" or type = "class". In this case, the predicted classes are obtained as classes with the highest predicted probability. Furthermore, for the probability predictions the weight-based inference can be conducted as well. If inference is desired, the supplied Ordered Forest must be estimated with honesty and subsampling. If prediction only is desired, estimation without honesty and with bootstrapping is recommended for optimal prediction performance.

The example below illustrates the predict.orf command for in-sample predictions and the subsequent information about the predictions printed to the console.

# get fitted values with the estimated orf
orf_fitted <- predict(orf_model)

# print orf fitted values
print(orf_fitted)
#> Ordered Forest Prediction object of class orf.prediction 
#> 
#> Prediction Type:                  probability 
#> Number of Categories:             3 
#> Sample Size:                      1000 
#> Number of Trees:                  1000 
#> Build:                            Subsampling 
#> Mtry:                             2 
#> Minimum Node Size:                5 
#> Honest Forest:                    TRUE 
#> Weight-Based Inference:           FALSE

Now, divide the data into train and test set for a out-of-sample prediction exercise and summarize the prediction results. Similarly to the above, also for the prediction summary a LaTeX table can be directly generated with the latex argument in the summary.predict.orf command.

# specify response and covariates for train and test
idx <- sample(seq(1, nrow(odata), 1), 0.8*nrow(odata))

# train set
Y_train <- odata[idx, 1]
X_train <- odata[idx, -1]

# test set
Y_test <- odata[-idx, 1]
X_test <- odata[-idx, -1]

# estimate Ordered Forest
orf_train <- orf(X_train, Y_train)

# predict the probabilities with the estimated orf
orf_test <- predict(orf_train, newdata = X_test, type = "probs", inference = FALSE)

# summary of the orf predictions
summary(orf_test)
#> Summary of the Ordered Forest Prediction 
#>                                           
#> type             Ordered Forest Prediction
#> prediction.type  probability              
#> categories       3                        
#> build            Subsampling              
#> num.trees        1000                     
#> mtry             2                        
#> min.node.size    5                        
#> replace          FALSE                    
#> sample.fraction  0.5                      
#> honesty          TRUE                     
#> honesty.fraction 0.5                      
#> inference        FALSE                    
#> sample.size      200

orf: `margins.orf`, `print.margins.orf`, `summary.margins.orf`

Besides the estimation and prediction of the conditional choice probabilities, the Ordered Forest enables also the estimation of the marginal effects, i.e. how these probabilities vary with changes in covariates. margins.orf estimates marginal effects at the mean, at the median, or the mean marginal effects, depending on the eval argument. The evaluation window for the marginal effects can be regulated by the user through the window argument, which is defined as the share of standard deviation of the particular covariate $X$ with default set as window = 0.1. Furthermore, new data for which marginal effects should be estimated can be supplied as well using the argument newdata as long as the new data lies within the support of $X$. Additionally to the estimation of the marginal effects, the weight-based inference for the effects is supported as well, controlled by the inference argument. Note again that the inference procedure is computationally exhausting exercise due to the estimation of the forest weights.

Furthermore, the marginal effect estimation procedure depends on the type of the particular covariate $X$. On one hand, for continuous covariates such as $X1$ and $X4$ in this example, the marginal effects are estimated as a derivative using two-sided numeric approximation. On the other hand, for discrete covariates such as $X2$ and $X3$ in this example, the marginal effects are estimated as a discrete change. In case of a binary variables such as $X3$, the marginal effect is estimated as a difference in the conditional probabilities evaluated at $X=1$ and $X=0$, respectively. In case of categorical variables such as $X2$, the conditional probabilities in the difference are evaluated at the mean of $X$ rounded up and down, respectively. For a detailed discussion of these quantities see Lechner and Okasa (2019).

The example below shows the usage of the margins.orf command with default settings and prints the basic estimation information together with the estimated effects for each covariate and each outcome class.

# estimate marginal effects of the orf
orf_margins <- margins(orf_model)

# print the results of the marginal effects estimation
print(orf_margins)
#> Ordered Forest Margins object of class margins.orf 
#> 
#> Evaluation Type:                  mean 
#> Evaluation Window:                0.1 
#> Number of Categories:             3 
#> New Data:                         FALSE 
#> Number of Trees:                  1000 
#> Build:                            Subsampling 
#> Mtry:                             2 
#> Minimum Node Size:                5 
#> Honest Forest:                    TRUE 
#> Weight-Based Inference:           FALSE 
#> 
#> ORF Marginal Effects: 
#> 
#>    Category 1 Category 2 Category 3
#> X1    -0.1212    -0.0098     0.1310
#> X2    -0.1137    -0.0228     0.1365
#> X3    -0.1220    -0.0164     0.1385
#> X4     0.0007     0.0003    -0.0010

Now, estimate the mean marginal effects with weight-based inference and summarize the estimation output as well as the estimated effects together with the inference results. Additionally, summary.margins.orf also supports the LaTeX summary table with the latex argument.

# estimate marginal effects of the orf with inference
orf_margins <- margins(orf_model, eval = "mean", window = 0.1,
                                  inference = TRUE, newdata = NULL)

# summarize the results of the marginal effects estimation
summary(orf_margins)
#> Summary of the Ordered Forest Margins 
#> 
#>                                         
#> type              Ordered Forest Margins
#> evaluation.type   mean                  
#> evaluation.window 0.1                   
#> new.data          FALSE                 
#> categories        3                     
#> build             Subsampling           
#> num.trees         1000                  
#> mtry              2                     
#> min.node.size     5                     
#> replace           FALSE                 
#> sample.fraction   0.5                   
#> honesty           TRUE                  
#> honesty.fraction  0.5                   
#> inference         TRUE                  
#> 
#> ORF Marginal Effects: 
#> 
#> --------------------------------------------------------------------------- 
#> X1 
#>                    Class      Effect     StdErr     tValue     pValue       
#>                      1       -0.1212     0.0221    -5.4740     0.0000   ***      
#>                      2       -0.0098     0.0211    -0.4634     0.6431            
#>                      3        0.1310     0.0283     4.6334     0.0000   ***      
#> X2 
#>                    Class      Effect     StdErr     tValue     pValue       
#>                      1       -0.1137     0.0251    -4.5208     0.0000   ***      
#>                      2       -0.0228     0.0354    -0.6443     0.5194            
#>                      3        0.1365     0.0456     2.9953     0.0027   ***      
#> X3 
#>                    Class      Effect     StdErr     tValue     pValue       
#>                      1       -0.1220     0.0457    -2.6682     0.0076   ***      
#>                      2       -0.0164     0.0447    -0.3676     0.7132            
#>                      3        0.1385     0.0605     2.2903     0.0220   **       
#> X4 
#>                    Class      Effect     StdErr     tValue     pValue       
#>                      1        0.0007     0.0014     0.4761     0.6340            
#>                      2        0.0003     0.0017     0.1764     0.8600            
#>                      3       -0.0010     0.0021    -0.4668     0.6406            
#> --------------------------------------------------------------------------- 
#> Significance levels correspond to: *** .< 0.01, ** .< 0.05, * .< 0.1 
#> ---------------------------------------------------------------------------

orf: Applications

The Ordered Forest estimator is currently used by the Swiss Institute for Empirical Economic Research (SEW-HSG) of the University of St.Gallen, Switzerland in the Soccer Analytics project for the probability predictions of win, draw and loss in soccer matches in the German Bundesliga and the Swiss Super League. More details about the soccer predictions can be found in Goller et al. (2021) and the most recent predictions are listed online at SEW Soccer Analytics (GER), SEW Soccer Analytics (SUI) and on Twitter.

orf: References

Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Goller, Daniel, Michael C Knaus, Michael Lechner, and Gabriel Okasa. 2021. “Predicting Match Outcomes in Football by an Ordered Forest Estimator.” In A Modern Guide to Sports Economics, 335–55. Edward Elgar Publishing.

Lechner, Michael. 2019. “Modified Causal Forests for Estimating Heterogeneous Causal Effects.” arXiv Preprint arXiv:1812.09487. https://arxiv.org/abs/1812.09487.

Lechner, Michael, and Gabriel Okasa. 2019. “Random Forest Estimation of the Ordered Choice Model.” arXiv Preprint arXiv:1907.02436. https://arxiv.org/abs/1907.02436.

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” Journal of the American Statistical Association, 1228–42. https://doi.org/10.1080/01621459.2017.1319839.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger : A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

orf: ordered random forests

Gabriel Okasa and Michael Lechner