In this vignette, I will introduce you to the ldt
package and its main features for dealing with Logistic and Probit
Regression models. You will learn how to estimate a binary model, make
predictions, and assess model uncertainty. Additionally, we will explore
the use of Principal Component Analysis as an alternative approach for
handling a large number of potential explanatory variables.
One of the key ideas behind ldt
is to minimize user
discretion by using a rule-based approach to select data. This approach
not only avoids discretion but also automates the process of searching
for the best models within a defined model set.
To demonstrate these features, I will create an artificial dataset with a dependent variable and both relevant and irrelevant explanatory variables. The dependent and relevant explanatory variables are sampled from a known binary model. While we can evaluate how well the estimation process finds the true parameters, our main focus will be on how to estimate, search, predict, and report results.
Let’s get started!
Let’s start by assuming that we know the structure of the model. We can do this by simulating data from two known binary models. The following command generates the required samples:
num_obs <- 100
num_exo <- 4L
p_positive <- 0.4
max_weight <- 2
sample_l <- sim.bin(num_exo, num_obs, probit = FALSE, pPos = p_positive, maxWeight = max_weight)
sample_p <- sim.bin(sample_l$coef, num_obs, probit = TRUE, pPos = p_positive, maxWeight = max_weight)
print(sample_l$coef)
> [1] -0.56047565 -0.23017749 1.55870831 0.07050839
We know the parameters of the systems because they are included in
the output of the sim.bin
function. Note that there is a
logit and a probit model. Each model has one equation or dependent
variable. This equation has an intercept and 3 exogenous variables. The
sample size is 100. The coefficient vector of the logit model is
generated randomly and is listed in the output, sample_l
.
It is used to sample data from the logit model and therefore, the
parameters of the two models are the same. In these samples, 40 percent
of observations are labeled as positive (because of pPos
argument). Finally, since max_weight
is larger than 1,
observations are weighted and there is a w element in the returned lists
that we should consider in the estimation process.
The LaTeX code for the equation of the two models is included in the
eqLatex
elements of the output. It results in the following
representations:
Remember that these are the parameters of the system. We can use the
glm
function to estimate them. The following code shows how
to do this. In the first line we prepare the equations and then we fit
two models for our two samples:
eq_str <- paste0("Y ~ ", paste0(colnames(sample_l$x[,-1]), collapse = " + "))
fit_l <- glm(eq_str, data = data.frame(sample_l$y, sample_l$x),
family = binomial(link = "logit"), weights = sample_l$w)
fit_p <- glm(eq_str, data = data.frame(sample_p$y, sample_p$x),
family = binomial(link = "probit"), weights = sample_l$w)
fit_sf_l <- sim.bin(fit_l$coefficients, probit = FALSE )
fit_sf_p <- sim.bin(fit_p$coefficients, probit = TRUE)
The last two lines are used for reporting the LaTeX formula. We use eqLatex element of the output and this is the result:
Logit: \[\begin{aligned} P(Y = 1 | X_2, X_3, X_4) = \frac{1}{1 + e^{-(-0.46 - 0.36 X_2 + 1.82 X_3 + 0.26 X_4)}} \end{aligned}\] Probit: \[\begin{aligned} P(Y = 1 | X_2, X_3, X_4) = \Phi(-0.76 - 0.59 X_2 + 2.39 X_3 + 0.37 X_4) \end{aligned}\]Note that these representations are not very appropriate for an estimated model because they do not report the estimated standard errors and there are better ways to do so. However, it suits our purpose here. Also, note that you can get better results (in terms of lower difference between actual and estimated coefficients) by increasing the sample size.
The following code does the same by using the estim.bin
function in the ldt
package:
fit_l <- estim.bin(sample_l$y, sample_l$x[,-1], sample_l$w)
fit_p <- estim.bin(sample_p$y, sample_p$x[,-1], sample_p$w, linkFunc = "probit")
res_table <- coefs.table(list(Logit = fit_l, Probit = fit_p),
regInfo = c("obs"), formatLatex = FALSE)
The last line converts the estimated result into a table for
presentation. Additional arguments can be used to control the format and
level of information displayed in the table. The kable
function can be used to report the contents of the table. This is the
result:
Logit | Probit | |
---|---|---|
Intercept | -0.46** | -0.82*** |
X1 | -0.36 | -0.47*** |
X2 | 1.82*** | 2.17*** |
X3 | 0.26 | 0.32** |
obs | 100 | 100 |
Differences between the results of systemfit
and
estim.bin
functions may be due to variations in
initialization or optimization procedures.
While the coefs.plot
function was discussed in another vignette on SUR models, this vignette
focuses on prediction rather than parameter estimation. As such, it will
not be covered here.
A binary regression model can be used in binary classification practices. It involves using the model to predict the likelihood that a new case will fall into one of two classes based on the values of the explanatory variables. The class is either positive (\(Y = 1\)) or negative (\(Y = 0\)). The regression model estimates the likelihood given the explanatory variables, i.e., \(P(Y=1∣X)\). We can calculate the other probability: \(P(Y=0∣X)=1−P(Y=1∣X)\). Deciding the class of the observation given the probability need a decision rule.
To continue our experiment, let’s focus on the logit model and
generate a sample of size 10 from the true model using the
sim.bin
function. We can predict using the
estim.bin
function in ldt
by setting the
newX
argument. This is the code:
sample_l_new <- sim.bin(sample_l$coef, 10, probit = FALSE)
fit_l <- estim.bin(sample_l$y, sample_l$x[,-1], sample_l$w, newX = sample_l_new$x[,-1])
We have three types of information: the actual class of the
observation (sample_l$y
), the probabilities used in the
simulation process (sample_l$p1
), and the predicted
probabilities fit_l$projection[,2]
. Let’s plot them:
As you can see, it’s possible to have positive observations with low probabilities and negative observations with high probabilities due to the random nature of the binomial distribution.
In practice, we do not see the green circles, i.e., \(P(Y=1∣X)\). We only see the actual class of observations. The binary regression model predicts the green circles. This means that even if we have the exact coefficient vector, the best we can do is to predict the exact probability and not the actual class of observation. There must be a decision-making process that converts the predicted probability to a specific class.
When using a binary regression model to make classification decisions, a common approach is to define a threshold value for the predicted probability. If the predicted probability for a case is greater than or equal to the threshold value, the case is classified as belonging to the positive class. If the predicted probability is less than the threshold value, the case is classified as belonging to the negative class.
However, threshold is not a parameter of the model but just a decision rule. The choice of threshold value can affect the balance between sensitivity and specificity but cannot completely eliminate classification error. One might design a more complex decision-making process than using just one fixed threshold level. There are several ways to evaluate the predictive power of a binary regression model. One common approach is to use a confusion matrix to calculate accuracy, sensitivity, specificity, and precision. Other metrics include AUC-ROC and Brier score. The AUC-ROC represents the probability that a randomly chosen positive case will have a higher predicted probability than a randomly chosen negative case. A lower Brier score indicates better predictive performance.
Similar to the previous subsection, we can use
coefs.table
function and report the results in a table:
res_table <- coefs.table(list(Logit = fit_l),
regInfo = c("obs", "aic", "sic", "aucIn", "brierIn"),
formatLatex = FALSE)
Y | |
---|---|
Intercept | -0.46** |
X1 | -0.36 |
X2 | 1.82*** |
X3 | 0.26 |
obs | 100 |
aic | 161.12 |
sic | 171.55 |
aucIn | 0.83 |
brierIn | 0.16 |
One metric (specific to our experiment) is not present in the table. Its formula is \(a = \frac{\sum_{i=1}^{n}(p_i-\hat{p}_i)^2}{n}\) where \(p_i\) represents the actual probabilities (the green circles in the plot) and \(\hat{p}_i\) represents the predicted probabilities (the red plus signs in the plot). This metric is similar to the Brier score but uses actual probabilities. A value of zero indicates a perfect fit and larger values indicate lower explanatory power. The value for this experiment is 0.0035284.
Let’s consider a more realistic situation where model uncertainty
exists. That’s where ldt
can help. In the previous
subsection, we knew all relevant explanatory variables. Here, we
consider a situation where there are some irrelevant variables too. We
limit the level of uncertainty and other practical issues by restricting
the number of these variables. The following code reflects our
assumptions:
sample_l$x <- cbind(sample_l$x, matrix(rnorm(num_obs * 50), ncol = 50,
dimnames = list(NULL,paste0("z",1:50))))
sample_l_new$x <- cbind(sample_l_new$x, matrix(rnorm(nrow(sample_l_new$x) * 50), ncol = 50,
dimnames = list(NULL,paste0("z",1:50))))
In our experiment, there are 50 irrelevant and 3 relevant variables.
The number of irrelevant data is relatively large and their names start
with the z
character. The second line of code creates
out-of-sample data in the extended sample for use in prediction.
The following code uses the search.bin
function to find
the actual model:
search_res <- search.bin(sample_l$y, sample_l$x[,-1], sample_l$w,
xSizes = c(1:5),
metricOptions = get.options.metric(typesIn = c("sic")))
The xSizes = c(1:4)
part assumes that we know the number
of relevant explanatory variables is less than 5. The
metric_options
part shows that we use SIC metrics to
evaluate and compare models.
This code is time-consuming and is not evaluated here. However, on my
system, the elapsed time is 77 seconds (the number of searched models is
317682). Note that if we change our previous guess and assume that the
maximum number of relevant variables is larger, for example 5, the size
of the practical model set becomes 3187367 (10 times larger) and this is
estimated in 977 seconds (12 times larger) on my system. Many factors
affect this time, including optimization process options. Also check the
parallel option in get.options.search()
function.
One might reduce the number of potential explanatory variables using
theory or statistical testing. Since ldt
dislikes user
discretion, it provides a more systematic approach. The idea behind it
is simple: estimate smaller models, select variables, estimate larger
models with fewer potential explanatory variables. Here is the code:
x_size_steps = list(c(1, 2), c(3), c(4), c(5))
count_steps = c(NA, 20, 10, 9)
search_step_res <-
search.bin.stepwise(y = sample_l$y, x = sample_l$x, w = sample_l$w,
xSizeSteps = x_size_steps, countSteps = count_steps,
metricOptions = get.options.metric(typesIn = c("aic","sic", "auc", "brier")),
searchItems = get.items.search(bestK = 10))
> Warning in .SearchDc(y, x, w, xSizes, xPartitions, costMatrices, searchLogit, :
> Error occurred in the search process. See 'result$counts'.
search_step_res
> method: bin
> expected: 2,763, searched: 2,763 (100%), failed: 54 (2%)
> elapsed time: 0.06699518 minutes
> --------
> Failures:
> 1. matrix singularity: 54 (100%)
> --------
> 1. aic:
> Y (best=131.239)
> 2. sic:
> Y (best=146.87)
> 3. aucIn:
> Y (best=0.896)
> 4. brierIn:
> Y (best=0.122)
The first two lines define the steps. We use all variables
(NA
in count_steps
means all) to estimate
models with sizes defined as the first element of
x_size_steps
. Then we select a number of variables from the
information provided by the best models and estimate models with sizes
determined by the second element of x_size_steps
. And so
on.
The size of the model subset and running time are greatly reduced. However, let’s see its performance.
To study or report results, we should use the summary
function. The output of a search project in ldt
does not
contain estimation results but only the minimum level of information to
replicate them. The summary
function does the job and
estimates the models. Here is the code:
ssum <- summary(search_step_res,
y = sample_l$y, x = sample_l$x, w = sample_l$w,
newX = sample_l_new$x,
printMsg = FALSE)
Usually, there is more than one model in the summary
output. This is because the output is first “target-variable-specific”
and second “evaluation-specific”. In this application, there is just one
target but we requested four different types of evaluations in the
get.options.metric function. We can report the results by creating a
list of estimated models and using the coefs.table function:
mod_list <- list("SIC" = ssum$sic$target1$model$bests$best1,
"AIC" = ssum$aic$target1$model$bests$best1,
"AUC" = ssum$aucIn$target1$model$bests$best1,
"Brier" = ssum$brierIn$target1$model$bests$best1)
res_table <- coefs.table(mod_list,
regInfo = c("obs", "aic", "sic", "aucIn", "brierIn"),
formatLatex = FALSE)
Since we set the newX
argument in the
summary
function, we can plot the predictions:
With current seed and other options, all metrics point to a common
model. In this regression model a relevant variable (X3
) is
missing and three irrelevant variables are present with non-significant
coefficients.
Let’s see if we can do any better by using out-of-sample evaluations. The following code is similar to the code in the previous section, but we define a out-of-sample process:
metric_options <- get.options.metric(typesOut = c("auc", "brier"),
seed = -seed, simFixSize = 5, trainRatio = 0.75)
search_step_res <-
search.bin.stepwise(y = sample_l$y, x = sample_l$x, w = sample_l$w,
xSizeSteps = x_size_steps, countSteps = count_steps,
metricOptions = metric_options,
searchItems = get.items.search(bestK = 10),
searchOptions = get.options.search(printMsg = FALSE))
> Warning in .SearchDc(y, x, w, xSizes, xPartitions, costMatrices, searchLogit, :
> Error occurred in the search process. See 'result$counts'.
search_step_res
> method: bin
> expected: 2,276, searched: 2,276 (100%), failed: 54 (2.4%)
> elapsed time: 0.06703867 minutes
> --------
> Failures:
> 1. matrix singularity: 54 (100%)
> --------
> 1. aucOut:
> Y (best=0.931)
> 2. brierOut:
> Y (best=0.094)
We use 0.75 ratio of the observations (determined by
trainRatio
) for estimating and the rest for testing. We
repeat this experiment 5 times (determined by simFixSize
).
We can report the result similar to the previous discussion:
You can get better results by increasing the number of observations. Also, you can change optimization algorithm options or other search options to improve performance.