The package discretefit
implements fast, Monte Carlo simulations for goodness-of-fit (GOF) tests for discrete distributions. This includes tests based on the root-mean-square statistic, the Chi-squared statistic, the log-likelihood-ratio (\(G^2\)) statistic, and the Kolmogovov-Smirnov statistic.
Simulations are written in C++ (utilizing Rcpp
) and are much faster than the simulated Chi-squared GOF test in the R stats
package and the simulated Kolmogorov-Smirnov GOF test in the dgof
package.
The GOF tests in discretefit
function on a vector of counts, x, and a vector of probabilities, p. In the below example, x represents a vector of counts for five categories, and p represents a vector of probabilities for each corresponding category. The GOF tests provides p-values for the null hypothesis that x is a random sample of the discrete distribution defined by p.
library(discretefit)
library(bench)
<- c(42, 0, 13, 2, 109)
x <- c(0.2, 0.05, 0.1, 0.05, 0.6)
p
<- c(rep(1, 4),
pp rep(2, 1),
rep(3, 2),
rep(4, 1),
rep(5, 12))
chisq_gof(x, p)
#> [1] 0.00259974
rms_gof(x, p)
#> [1] 0.03749625
g_gof(x, p)
#> [1] 0.00019998
ks_gof(x, p)
#> [1] 0.2459754
The simulated Chi-squared GOF test in discretefit
produces identical answers to the simulated Chi-squared GOF test in the stats
package that is part of base R.
set.seed(499)
chisq_gof(x, p, reps = 2000)
#> [1] 0.002998501
set.seed(499)
chisq.test(x, p = p, simulate.p.value = TRUE)$p.value
#> [1] 0.002998501
However, because Monte Carlo simulations in discretefit
are implemented in C++, chisq_gof
is much faster than chisq.test
, especially when a large number of simulations are required.
::system_time(
benchchisq_gof(x, p, reps = 20000)
)#> process real
#> 203ms 198ms
::system_time(
benchchisq.test(x, p = p, simulate.p.value = TRUE, B = 20000)
)#> process real
#> 2.94s 3.01s
The ks_gof
function in discretefit
is also faster than the simulated Kolmogorov-Smirnov test in the dgof
package. (The ks.test
function in the stats
package in base R does not include a simulated test for discrete distributions.)
The p-values produced by ks_gof
, however, are not exactly identical to those produced by ks.test in the dgof
package because of slight variations in the algorithms. (One variation relates to the equation for calculating p-values discussed below. Another variation relates to how the two algorithms access R’s random number generator.)
<- c(114, 118, 112, 158)
x <- c(1, 2, 3, 4, 5, 5)
y <- c(0.2, 0.2, 0.2, 0.4)
p
::system_time(
benchks_gof(x, p, reps = 20000)
)#> process real
#> 609ms 626ms
::system_time(
bench::ks.test(x, ecdf(y), simulate.p.value = TRUE, B = 20000)
dgof
)#> process real
#> 5.31s 5.36s
Additionally, the simulated GOF tests in base R and the dgof
package are vectorized, so for large vectors attempting a large number of simulations maybe not be possible because of memory constraints. Since the function in discretefit
are not vectorized, memory use is minimized.
In a surprising number of cases, a simulated GOF test based on the root-mean-square statistic outperforms the Chi-squared test and other tests in the Cressie-Read power divergence family. This has been demonstrated by Perkins, Tygert, and Ward (2011). They provide the following toy example.
Take a discrete distribution with 50 bins (or categories). The probability for the first and second bin is 0.25. The probability for each of the remaining 48 bins is 0.5 / 48 (~0.0104).
Now take the observed counts of 15 for the first bin, 5 for the second bin, and zero for each of the remaining 48 bins. It’s obvious that these observations are very unlikely to occur for random sample from the above distribution. However, the Chi-squared test and \(G^2\) test fail to reject the null hypothesis.
<- c(15, 5, rep(0, 48))
x <- c(0.25, 0.25, rep(1/(2 * 50 -4), 48))
p
chisq_gof(x, p)
#> [1] 0.9716028
g_gof(x, p)
#> [1] 0.6643336
By contrast, the root-mean-square test convincingly rejects the null hypothesis.
rms_gof(x, p)
#> [1] 9.999e-05
For additional examples, also see Perkins, Tygert, and Ward (2011) and Ward and Carroll (2014).
All p-values calculated by discretefit
follow the formula for “exact” p-values proposed by Dwass (1957). For the below equation, let m represent the number of simulations and let B represent the number of simulations where the test statistic for the simulated data is greater than or equal to the test statistic for the observed data.
\[{p}_{u} = P(B <= b) = \frac{b + 1}{m + 1}\]
This is the equation used to calculate simulated p-values in the stats
package but the dgof
package uses the unbiased estimator, \(\frac{B}{m}\). For an explanation of why the biased estimator yields a test of the correct size and the unbiased estimator does not, see Phipson and Smyth (2011).
As noted above, the stats
package in base R implements a simulated Chi-squared GOF test, and the dgof
package implements simulated Kolmogorov-Smirnov GOF test.
I’m not aware of an R package that implements a simulated \(G^2\) GOF test but the packages RVAideMemoire
and DescTools
implement GOF tests that utilize approximations based on the Chi-squared distribution.
I’m not aware of another R package that implements a root-mean-square GOF test.
Dwass, Meyer. “Modified randomization tests for nonparametric hypotheses.” Annuls of Mathematical Statistics, 1957. https://doi.org/10.1214/aoms/1177707045
Eddelbuettel, Dirk and Romain Francois. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 2011. https://www.jstatsoft.org/article/view/v040i08
Perkins, William, Mark Tygert, and Rachel Ward. “Computing the confidence levels for a root-mean-square test of goodness-of-fit.” Applied Mathematics and Computation, 2011. https://doi.org/10.1016/j.amc.2011.03.124
Phipson, Belinda, and Gordon K. Smyth. “Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 2010. https://dx.doi.org/10.2202/1544-6115.1585
Ward, Rachel and Raymond J. Carroll. “Testing Hardy–Weinberg equilibrium with a simple root-mean-square statistic.” Biostatistics, 2014. https://doi.org/10.1093/biostatistics/kxt028