Introduction
The summclust package allows to compute leverage
statistics for clustered errors and fast CRV3(J)
variance-covariance matrices as described in MacKinnon, J.G., Nielsen, M.Ø.,
Webb, M.D., 2022. Leverage, influence, and the jackknife in clustered
regression models: Reliable inference using summclust.
It is a post-estimation command and currently supports methods for
objects of type lm (from stats) and
fixest (from the fixest package).
CRV 1-3 Cluster Robust Variance Estimators and Jackknife formulations
summclust handles cluster robust variance estimation of
linear regression models of the form
\[\begin{equation} y = \begin{bmatrix} y_{1} \\ y_{2} \\ ...\\ y_{G} \end{bmatrix} = X\beta + u = \begin{bmatrix} X_{1} \\ X_{2} \\ ...\\ X_{G} \end{bmatrix} \beta + \begin{bmatrix} u_{1} \\ u_{2} \\ ...\\ u_{G} \end{bmatrix}, \end{equation}\]
where group \(g\) contains \(N_{g}\) observations so that \(N = \sum_{g = 1}^{G} N_{g}\). The regression residuals \(u\) are allowed to be correlated within clusters, but are assumed to be uncorrelated across clusters. %In consequence, the models’ covariance matrix is block diagonal. %For each cluster, we denote \(E(u_{g} u_{g}') =\Omega_{g}\).
with \(E(u|X) = 0\).
The literature on cluster robust inference has proposed three different estimators, which all follow the same ‘sandwich’ structure
\[\begin{equation} (X'X)^{-1} (\sum_{g=1}^{G} \Sigma_{g} ) (X'X)^{-1}. \end{equation}\]
The three different types of CRV estimators depend on how \(\Sigma_{g}\) is estimated.
The most common cluster robust estimator, the CRV1 estimator, is defined as
\[\begin{equation} CRV1: \hat{V}_{1}(\hat{\beta}) = m (X'X)^{-1} (\sum_{g=1}^{G} s_{g} s_{g}') (X'X)^{-1}. \end{equation}\]
where \(s_g = X_{g}'\hat{u}_{g}\).
The CRV2 estimator is computed as
\[\begin{equation} CRV2: \hat{V}_{2}(\hat{\beta}) = (X'X)^{-1} (\sum_{g=1}^{G} s^{2}_{g} s^{2}_{g}') (X'X)^{-1}. \end{equation}\]
where \(s^{2}_g = X_{g}' M_{gg}^{-1/2} \hat{u}_{g}\).
\(M_{gg}\) is defined as …
Last, the CRV3 estimator is defined as
\[\begin{equation} CRV3: \hat{V}_{3}(\hat{\beta}) = m (X'X)^{-1} (\sum_{g=1}^{G} s^{3}_{g} s^{3}_{g}') (X'X)^{-1}. \end{equation}\]
with \(s^{3}_{g} = X_{g}' M_{gg}^{-1} \hat{u}_{g}\) with \(m = G/(G-1)\).
Building on work by Niccodemi and … MacKinnon, Nielsen and Webb show that the CRV3 estimator can be computed as a Jackknife estimator.
First, let’s define \(\hat{\beta}^{(g)}\), the OLS estimate of (1) when cluster g is omitted:
\[\begin{equation} \hat{\beta}^{(g)} = ((X'X)^{-1} - (X_{g}'X_{g})^{-1})(X'y - X_{g}'y_{g}), g = 1, ... , G. \end{equation}\]
MNW show the the CRV3 estimator is equivalent to
computing
\[\begin{equation} \hat{V}_{3}(\hat{\beta}) = \frac{G}{G-1} \sum{g = 1}^{G} (\hat{\beta}^{(g)} - \hat{\beta}) (\hat{\beta}^{(g)} - \hat{\beta})' \end{equation}\]
They further propose the following Jackknive estimator, CRVJ:
\[\begin{equation} \hat{V}_{3J}(\hat{\beta}) = \frac{G}{G-1} \sum{g = 1}^{G} (\hat{\beta}^{(g)} - \bar{\beta}) (\hat{\beta}^{(g)} - \bar{\beta})' \end{equation}\]
with \(\bar{\beta} = G^{-1} \sum_{g=1}^{G} \hat{\beta}^{(g)}\).
Both estimators can be computed very quickly (as long as the number
of clusters does not get too large), and both estimators are implemented
in summclust.
The summclust function
library(summclust)
library(lmtest)
library(haven)
nlswork <- read_dta("http://www.stata-press.com/data/r9/nlswork.dta")
# drop NAs at the moment
nlswork <- nlswork[, c("ln_wage", "grade", "age", "birth_yr", "union", "race", "msp", "ind_code")]
nlswork <- na.omit(nlswork)
lm_fit <- lm(
ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union + race + msp,
data = nlswork)
summclust_res <- summclust(
obj = lm_fit,
cluster = ~ind_code,
type = "CRV3")
# CRV3-based inference - exactly matches output of summclust-stata
coeftable(summclust_res, param = c("msp", "union"))
#> coef tstat se p_val conf_int_l conf_int_u
#> union 0.2039597 2.440122 0.08358587 0.03281561 0.01998847 0.387930980
#> msp -0.0275151 -1.956404 0.01406412 0.07628064 -0.05847002 0.003439815
summary(summclust_res, param = c("msp","union"))
#> coef tstat se p_val conf_int_l conf_int_u
#> union 0.2039597 2.440122 0.08358587 0.03281561 0.01998847 0.387930980
#> msp -0.0275151 -1.956404 0.01406412 0.07628064 -0.05847002 0.003439815
#>
#> leverage partial-leverage-msp partial-leverage-union beta-msp
#> Min. 0.09332052 0.001622359 0.0006662968 -0.03320040
#> 1st Qu. 0.70440923 0.009133996 0.0048899422 -0.02893131
#> Median 3.51549151 0.056682344 0.0379535242 -0.02776470
#> Mean 5.41666667 0.083333333 0.0833333333 -0.02691999
#> 3rd Qu. 6.41132962 0.106083114 0.1004277711 -0.02610221
#> Max. 20.28918187 0.312994532 0.3597669210 -0.01583453
#> beta-union
#> Min. 0.1624754
#> 1st Qu. 0.1994694
#> Median 0.2045197
#> Mean 0.2053997
#> 3rd Qu. 0.2056569
#> Max. 0.2754228To visually inspect the leverage statistics, use the
plot method

#>
#> $coef_leverage

#>
#> $coef_beta

Using summclust with coefplot and
fixest
Note that you can also use CVR3 and CRV3J covariance matrices
computed via summclust with the lmtest() and
fixest packages.
library(lmtest)
library(fixest)
df <- length(summclust_res$cluster) - 1
# with lmtest
CRV1 <- coeftest(lm_fit, sandwich::vcovCL(lm_fit, ~ind_code), df = df)
CRV3 <- coeftest(lm_fit, summclust_res$vcov, df = df)
CRV1[c("union", "race", "msp"),]
#> Estimate Std. Error t value Pr(>|t|)
#> union 0.20395972 0.061167499 3.334446 0.0066585766
#> race -0.08619813 0.016150418 -5.337207 0.0002384275
#> msp -0.02751510 0.009293046 -2.960827 0.0129561148
CRV3[c("union", "race", "msp"),]
#> Estimate Std. Error t value Pr(>|t|)
#> union 0.20395972 0.08358587 2.440122 0.032815614
#> race -0.08619813 0.01904684 -4.525586 0.000864074
#> msp -0.02751510 0.01406412 -1.956404 0.076280639
confint(CRV1)[c("union", "race", "msp"),]
#> 2.5 % 97.5 %
#> union 0.06933097 0.338588481
#> race -0.12174496 -0.050651302
#> msp -0.04796896 -0.007061245
confint(CRV3)[c("union", "race", "msp"),]
#> 2.5 % 97.5 %
#> union 0.01998847 0.387930980
#> race -0.12811995 -0.044276312
#> msp -0.05847002 0.003439815
# with fixest
feols_fit <- feols(
ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union + race + msp,
data = nlswork)
fixest::coeftable(
feols_fit,
vcov = summclust_res$vcov,
ssc = ssc(adj = FALSE, cluster.adj = FALSE)
)[c("msp", "union", "race"),]
#> Estimate Std. Error t value Pr(>|t|)
#> msp -0.02751510 0.01406412 -1.956404 5.043213e-02
#> union 0.20395972 0.08358587 2.440122 1.469134e-02
#> race -0.08619813 0.01904684 -4.525586 6.059226e-06The p-value and confidence intervals for
fixest::coeftable() differ from
lmtest::coeftest() and summclust::coeftable().
This is due to the fact that fixest::coeftable() uses a
different degree of freedom for the t-distribution used in these
calculation (I believe it uses t(N-1)).