Introduction
The summclust
package allows to compute leverage
statistics for clustered errors and fast CRV3(J)
variance-covariance matrices as described in MacKinnon, J.G., Nielsen, M.Ø.,
Webb, M.D., 2022. Leverage, influence, and the jackknife in clustered
regression models: Reliable inference using summclust.
It is a post-estimation command and currently supports methods for
objects of type lm
(from stats
) and
fixest
(from the fixest
package).
CRV 1-3 Cluster Robust Variance Estimators and Jackknife formulations
summclust
handles cluster robust variance estimation of
linear regression models of the form
\[\begin{equation} y = \begin{bmatrix} y_{1} \\ y_{2} \\ ...\\ y_{G} \end{bmatrix} = X\beta + u = \begin{bmatrix} X_{1} \\ X_{2} \\ ...\\ X_{G} \end{bmatrix} \beta + \begin{bmatrix} u_{1} \\ u_{2} \\ ...\\ u_{G} \end{bmatrix}, \end{equation}\]
where group \(g\) contains \(N_{g}\) observations so that \(N = \sum_{g = 1}^{G} N_{g}\). The regression residuals \(u\) are allowed to be correlated within clusters, but are assumed to be uncorrelated across clusters. %In consequence, the models’ covariance matrix is block diagonal. %For each cluster, we denote \(E(u_{g} u_{g}') =\Omega_{g}\).
with \(E(u|X) = 0\).
The literature on cluster robust inference has proposed three different estimators, which all follow the same ‘sandwich’ structure
\[\begin{equation} (X'X)^{-1} (\sum_{g=1}^{G} \Sigma_{g} ) (X'X)^{-1}. \end{equation}\]
The three different types of CRV estimators depend on how \(\Sigma_{g}\) is estimated.
The most common cluster robust estimator, the CRV1 estimator, is defined as
\[\begin{equation} CRV1: \hat{V}_{1}(\hat{\beta}) = m (X'X)^{-1} (\sum_{g=1}^{G} s_{g} s_{g}') (X'X)^{-1}. \end{equation}\]
where \(s_g = X_{g}'\hat{u}_{g}\).
The CRV2 estimator is computed as
\[\begin{equation} CRV2: \hat{V}_{2}(\hat{\beta}) = (X'X)^{-1} (\sum_{g=1}^{G} s^{2}_{g} s^{2}_{g}') (X'X)^{-1}. \end{equation}\]
where \(s^{2}_g = X_{g}' M_{gg}^{-1/2} \hat{u}_{g}\).
\(M_{gg}\) is defined as …
Last, the CRV3 estimator is defined as
\[\begin{equation} CRV3: \hat{V}_{3}(\hat{\beta}) = m (X'X)^{-1} (\sum_{g=1}^{G} s^{3}_{g} s^{3}_{g}') (X'X)^{-1}. \end{equation}\]
with \(s^{3}_{g} = X_{g}' M_{gg}^{-1} \hat{u}_{g}\) with \(m = G/(G-1)\).
Building on work by Niccodemi and … MacKinnon, Nielsen and Webb show that the CRV3 estimator can be computed as a Jackknife estimator.
First, let’s define \(\hat{\beta}^{(g)}\), the OLS estimate of (1) when cluster g is omitted:
\[\begin{equation} \hat{\beta}^{(g)} = ((X'X)^{-1} - (X_{g}'X_{g})^{-1})(X'y - X_{g}'y_{g}), g = 1, ... , G. \end{equation}\]
MNW show the the CRV3
estimator is equivalent to
computing
\[\begin{equation} \hat{V}_{3}(\hat{\beta}) = \frac{G}{G-1} \sum{g = 1}^{G} (\hat{\beta}^{(g)} - \hat{\beta}) (\hat{\beta}^{(g)} - \hat{\beta})' \end{equation}\]
They further propose the following Jackknive estimator, CRVJ:
\[\begin{equation} \hat{V}_{3J}(\hat{\beta}) = \frac{G}{G-1} \sum{g = 1}^{G} (\hat{\beta}^{(g)} - \bar{\beta}) (\hat{\beta}^{(g)} - \bar{\beta})' \end{equation}\]
with \(\bar{\beta} = G^{-1} \sum_{g=1}^{G} \hat{\beta}^{(g)}\).
Both estimators can be computed very quickly (as long as the number
of clusters does not get too large), and both estimators are implemented
in summclust
.
The summclust
function
library(summclust)
library(lmtest)
library(haven)
nlswork <- read_dta("http://www.stata-press.com/data/r9/nlswork.dta")
# drop NAs at the moment
nlswork <- nlswork[, c("ln_wage", "grade", "age", "birth_yr", "union", "race", "msp", "ind_code")]
nlswork <- na.omit(nlswork)
lm_fit <- lm(
ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union + race + msp,
data = nlswork)
summclust_res <- summclust(
obj = lm_fit,
cluster = ~ind_code,
type = "CRV3")
# CRV3-based inference - exactly matches output of summclust-stata
coeftable(summclust_res, param = c("msp", "union"))
#> coef tstat se p_val conf_int_l conf_int_u
#> union 0.2039597 2.440122 0.08358587 0.03281561 0.01998847 0.387930980
#> msp -0.0275151 -1.956404 0.01406412 0.07628064 -0.05847002 0.003439815
summary(summclust_res, param = c("msp","union"))
#> coef tstat se p_val conf_int_l conf_int_u
#> union 0.2039597 2.440122 0.08358587 0.03281561 0.01998847 0.387930980
#> msp -0.0275151 -1.956404 0.01406412 0.07628064 -0.05847002 0.003439815
#>
#> leverage partial-leverage-msp partial-leverage-union beta-msp
#> Min. 0.09332052 0.001622359 0.0006662968 -0.03320040
#> 1st Qu. 0.70440923 0.009133996 0.0048899422 -0.02893131
#> Median 3.51549151 0.056682344 0.0379535242 -0.02776470
#> Mean 5.41666667 0.083333333 0.0833333333 -0.02691999
#> 3rd Qu. 6.41132962 0.106083114 0.1004277711 -0.02610221
#> Max. 20.28918187 0.312994532 0.3597669210 -0.01583453
#> beta-union
#> Min. 0.1624754
#> 1st Qu. 0.1994694
#> Median 0.2045197
#> Mean 0.2053997
#> 3rd Qu. 0.2056569
#> Max. 0.2754228
To visually inspect the leverage statistics, use the
plot
method
#>
#> $coef_leverage
#>
#> $coef_beta
Using summclust
with coefplot
and
fixest
Note that you can also use CVR3 and CRV3J covariance matrices
computed via summclust
with the lmtest()
and
fixest
packages.
library(lmtest)
library(fixest)
df <- length(summclust_res$cluster) - 1
# with lmtest
CRV1 <- coeftest(lm_fit, sandwich::vcovCL(lm_fit, ~ind_code), df = df)
CRV3 <- coeftest(lm_fit, summclust_res$vcov, df = df)
CRV1[c("union", "race", "msp"),]
#> Estimate Std. Error t value Pr(>|t|)
#> union 0.20395972 0.061167499 3.334446 0.0066585766
#> race -0.08619813 0.016150418 -5.337207 0.0002384275
#> msp -0.02751510 0.009293046 -2.960827 0.0129561148
CRV3[c("union", "race", "msp"),]
#> Estimate Std. Error t value Pr(>|t|)
#> union 0.20395972 0.08358587 2.440122 0.032815614
#> race -0.08619813 0.01904684 -4.525586 0.000864074
#> msp -0.02751510 0.01406412 -1.956404 0.076280639
confint(CRV1)[c("union", "race", "msp"),]
#> 2.5 % 97.5 %
#> union 0.06933097 0.338588481
#> race -0.12174496 -0.050651302
#> msp -0.04796896 -0.007061245
confint(CRV3)[c("union", "race", "msp"),]
#> 2.5 % 97.5 %
#> union 0.01998847 0.387930980
#> race -0.12811995 -0.044276312
#> msp -0.05847002 0.003439815
# with fixest
feols_fit <- feols(
ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union + race + msp,
data = nlswork)
fixest::coeftable(
feols_fit,
vcov = summclust_res$vcov,
ssc = ssc(adj = FALSE, cluster.adj = FALSE)
)[c("msp", "union", "race"),]
#> Estimate Std. Error t value Pr(>|t|)
#> msp -0.02751510 0.01406412 -1.956404 5.043213e-02
#> union 0.20395972 0.08358587 2.440122 1.469134e-02
#> race -0.08619813 0.01904684 -4.525586 6.059226e-06
The p-value and confidence intervals for
fixest::coeftable()
differ from
lmtest::coeftest()
and summclust::coeftable()
.
This is due to the fact that fixest::coeftable()
uses a
different degree of freedom for the t-distribution used in these
calculation (I believe it uses t(N-1)).