Introduction to the Ebrahim-Farrington Goodness-of-Fit Test

Background and Motivation

Goodness-of-fit testing is crucial in logistic regression to assess whether the fitted model adequately describes the data. The most commonly used test is the Hosmer-Lemeshow test, but it has several limitations:

Limited power for detecting certain types of model misspecification
Dependency on grouping strategy which can affect results
Poor performance with sparse data or continuous covariates

The Ebrahim-Farrington test addresses these limitations by using a modified Pearson chi-square statistic based on Farrington’s (1996) theoretical framework, but simplified for practical implementation with binary data.

Installation and Loading

# Install from GitHub
devtools::install_github("ebrahimkhaled/ebrahim.gof")

# Load the package
library(ebrahim.gof)

library(ebrahim.gof)

Basic Usage

The main function ef.gof() performs the goodness-of-fit test:

# Simulate binary data
set.seed(123)
n <- 500
x <- rnorm(n)
linpred <- 0.5 + 1.2 * x
prob <- plogis(linpred)  # Convert to probabilities
y <- rbinom(n, 1, prob)

# Fit logistic regression
model <- glm(y ~ x, family = binomial())
predicted_probs <- fitted(model)

# Perform Ebrahim-Farrington test
result <- ef.gof(y, predicted_probs, G = 10)
print(result)
#>                 Test Test_Statistic   p_value
#> 1 Ebrahim-Farrington      -1.250567 0.8944537

Understanding the Test Statistic

For binary data with automatic grouping, the Ebrahim-Farrington test statistic is:

\[Z_{EF} = \frac{T_{EF} - (G - 2)}{\sqrt{2(G-2)}}\]

Where: - \(T_{EF}\) is the modified Pearson chi-square statistic - \(G\) is the number of groups - The test statistic follows a standard normal distribution under \(H_0\)

The null hypothesis is that the model fits the data adequately.

Comparing with Different Group Numbers

The number of groups \(G\) can affect the test’s performance:

# Test with different numbers of groups
group_sizes <- c(4, 8, 10, 15, 20)
results <- data.frame(
  Groups = group_sizes,
  P_value = sapply(group_sizes, function(g) {
    ef.gof(y, predicted_probs, G = g)$p_value
  })
)
print(results)
#>   Groups   P_value
#> 1      4 0.7597797
#> 2      8 0.3993666
#> 3     10 0.8944537
#> 4     15 0.7542151
#> 5     20 0.3920783

Comparison with Hosmer-Lemeshow Test

Let’s compare the Ebrahim-Farrington test with the traditional Hosmer-Lemeshow test:

# Hosmer-Lemeshow test (requires ResourceSelection package)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
  library(ResourceSelection)
  
  # Perform both tests
  ef_result <- ef.gof(y, predicted_probs, G = 10)
  hl_result <- hoslem.test(y, predicted_probs, g = 10)
  
  # Compare results
  comparison <- data.frame(
    Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"),
    P_value = c(ef_result$p_value, hl_result$p.value),
    Test_Statistic = c(ef_result$Test_Statistic, hl_result$statistic)
  )
  print(comparison)
} else {
  cat("ResourceSelection package not available for comparison\n")
}
#> ResourceSelection 0.3-6   2023-06-27
#>                         Test   P_value Test_Statistic
#>           Ebrahim-Farrington 0.8944537      -1.250567
#> X-squared    Hosmer-Lemeshow 0.9431075       2.855296

Power Analysis Example

Let’s examine the power of the test to detect model misspecification:

# Function to simulate power under model misspecification
simulate_power <- function(n, beta_quad = 0.1, n_sims = 100, G = 10) {
  rejections_ef <- 0
  rejections_hl <- 0
  
  for (i in 1:n_sims) {
    # Generate data with quadratic term (true model)
    x <- runif(n, -2, 2)
    linpred_true <- 0 + x + beta_quad * x^2
    prob_true <- plogis(linpred_true)
    y <- rbinom(n, 1, prob_true)
    
    # Fit misspecified linear model (omitting quadratic term)
    model_mis <- glm(y ~ x, family = binomial())
    pred_probs <- fitted(model_mis)
    
    # Ebrahim-Farrington test
    ef_test <- ef.gof(y, pred_probs, G = G)
    if (ef_test$p_value < 0.05) rejections_ef <- rejections_ef + 1
    
    # Hosmer-Lemeshow test (if available)
    if (requireNamespace("ResourceSelection", quietly = TRUE)) {
      hl_test <- ResourceSelection::hoslem.test(y, pred_probs, g = G)
      if (hl_test$p.value < 0.05) rejections_hl <- rejections_hl + 1
    }
  }
  
  power_ef <- rejections_ef / n_sims
  power_hl <- if (requireNamespace("ResourceSelection", quietly = TRUE)) {
    rejections_hl / n_sims
  } else {
    NA
  }
  
  return(list(power_ef = power_ef, power_hl = power_hl))
}

# Calculate power for different sample sizes
sample_sizes <- c(200, 500, 1000)
power_results <- data.frame(
  n = sample_sizes,
  EbrahimFarrington_Power = sapply(sample_sizes, function(n) {
    simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_ef
  })
)

if (requireNamespace("ResourceSelection", quietly = TRUE)) {
  power_results$HosmerLemeshow_Power <- sapply(sample_sizes, function(n) {
    simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_hl
  })
}

print(power_results)
#>      n EbrahimFarrington_Power HosmerLemeshow_Power
#> 1  200                    0.08                 0.12
#> 2  500                    0.20                 0.14
#> 3 1000                    0.26                 0.22

Handling Grouped Data (Original Farrington Test)

For datasets with grouped observations (multiple trials per covariate pattern), you can use the original Farrington test:

# Simulate grouped data
set.seed(456)
n_groups <- 30
m_trials <- sample(5:20, n_groups, replace = TRUE)
x_grouped <- rnorm(n_groups)
prob_grouped <- plogis(0.2 + 0.8 * x_grouped)
y_grouped <- rbinom(n_groups, m_trials, prob_grouped)

# Create data frame and fit model
data_grouped <- data.frame(
  successes = y_grouped,
  trials = m_trials,
  x = x_grouped
)

model_grouped <- glm(
  cbind(successes, trials - successes) ~ x,
  data = data_grouped,
  family = binomial()
)

predicted_probs_grouped <- fitted(model_grouped)

# Original Farrington test for grouped data
result_grouped <- ef.gof(
  y_grouped,
  predicted_probs_grouped,
  model = model_grouped,
  m = m_trials,
  G = NULL  # No automatic grouping for original test
)

print(result_grouped)
#>                  Test Test_Statistic   p_value
#> 1 Farrington-Original      -1.476122 0.9300444

Practical Guidelines

When to Use Each Test Mode

Ebrahim-Farrington mode (G specified):
- Binary response data (0/1)
- Want automatic grouping
- Computationally efficient
- Recommended for most applications
Original Farrington mode (m provided, G = NULL):
- Grouped binomial data
- Multiple trials per covariate pattern
- Requires fitted model object

Choosing the Number of Groups

G = 10: Standard choice, comparable to Hosmer-Lemeshow
G = 4-8: For smaller datasets (n < 200)
G = 20+: For larger datasets (n > 20000)
Rule of thumb: Ensure each group sample size is large enough.

Interpreting Results

p-value > 0.05: No evidence of lack of fit (fail to reject H₀)
p-value ≤ 0.05: Evidence of model misspecification (reject H₀)
Very small p-values: Strong evidence against model adequacy

Advantages over Hosmer-Lemeshow Test

Better Power: More sensitive to model misspecification
Theoretical Foundation: Based on rigorous asymptotic theory
Sparse Data Handling: Specifically designed for fully sparse data
Computational Efficiency: Simplified calculations for binary data

Limitations and Considerations

Group Selection: Results can vary with different numbers of groups
Sample Size: More reliable with larger sample sizes (n ≥ 100)
Model Complexity: Performance with highly complex models needs further study

References

Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. Journal of the Royal Statistical Society. Series B (Methodological), 58(2), 349-360.
Ebrahim, Khaled Ebrahim (2024). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. Master’s Thesis, Alexandria University.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley.
Hosmer, D. W., & Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics - Theory and Methods, 9(10), 1043–1069. https://doi.org/10.1080/03610928008827941

The Ebrahim-Farrington test provides a powerful and practical tool for assessing goodness-of-fit in logistic regression, particularly for binary data and sparse datasets. Its simplified implementation makes it accessible for routine use while maintaining strong theoretical foundations.