knitr::opts_chunk$set(
  cache = FALSE,
  collapse = TRUE,
  comment = "#>"
)

Just Another R Package to Calculate Robust Standard Errors for Linear Models

As the title suggests, this is Just Another Rrobust Standard Errors R package for linear models. The main goal of writing this package was for convenience. As robust standard errors calculations have become the norm in the social sciences, it seemed to be good to have a simply way to implement them.

The package will overwrite the summary.lm S3 method to add two optional arguments, robust and cluster. Hence, when the package is loaded, a warning message should appear that the method summary.lm method was "masked." All robust covariance matrix calculations are done in C++ using the Armadillo linear algebra library and sourced via the RcppArmadillo package. The code should be therefore faster than alternative packages.

Right now, the package implements only a small number of robust covariance matrix estimators. Also, the functions provided will work only on linear regression models (i.e., R objects of class lm). The package might be extended in the future to other models. But for now, this is all I've got.

Installation

The general usage of the package is quite simple. First, the package needs to be installed and attached to the search path:

devtools::install_github("baruuum/jars")
library("jars")
library("jars")

Thereafter, heteroskedasticity and cluster-robust standard errors can be calculated by specifying the options robust and cluster, respectively.

Example: Heteroskedasticity-robust Covariance Matrix

To show how the package might be put to work, let us first simulate some data.

set.seed(15232)

# obs
n = 5000

# predictors
x1 = runif(n)
x2 = rnorm(n)

# clusters
z = sample(1:6, n, replace = T)

# outcome
y = 2 + runif(1, -1, 1) * x1 + runif(1, -1, 1) * x2 + rnorm(n)

# fit linear model
m = lm(y ~ x1 + x2)

After this, we can calculate heteroskedasticity robust standard errors by using the summary function:

summary(m, robust = TRUE)

To see the stars, we might change the default star = FALSE option to TRUE in the print function:

mysum = summary(m, robust = TRUE)
print(mysum, stars = T)

By default, the robust = TRUE option will calculate the HC1 type of robust standard errors. Right now, the package supports HC0 to HC4. The different estimators differ only in their degrees of freedom correction to correct for finite-sample bias. To estimate a different type of robust standard errors, the type option might be used. For example, to obtain HC3-type robust standard errors, we just add the type = "HC3" option:

summary(m, robust = T, type = "HC3")

Example: Clustering-robust Covariance Matrix

The code to compute cluster-robust standard errors looks quite similar. The only difference is that we have to pass into the summary function an integer vector that contains information regarding the clustered units. We have simulated such a vector above (z), which we might use here:

summary(m, cluster = z, type = "HC1")

Again, by default, the HC1 type degree of freedom correction is used. The HC1 type is also the default in STATA, when the robust or vce(cluster) option is used. For the clustered standard errors, only HC0 and HC1 types are currently available.

A note of caution: It is desirable that the clustering variable has no missing values. In the current implementation, the cluster-robust standard errors will be calculated after the observations for which the clustering variable is missing are dropped. This leads to the problem that the sample from which the coefficients of the linear model are estimated and that from which the standard errors are estimated will differ. This behavior contrasts with that of the sandwich package, where the observations for which the clustering variable are NA are treated as their own cluster. Both the sandwich package as well as this package will throw a warning if any missing values on the clustering variable are found in the data.

Different Degrees-of-freedom Corrections

Under heteroskedasticity, the covariance matrix of the OLS estimator $\beta$ is

$$\text{Cov}[\beta\,\vert\,X] = (X'X)^{-1}X'\Omega X(X'X)^{-1},$$

The essense of robust covariance matrix estimation boils down, to estimating the matrix $X'\Omega X$ which is sometimes called the "meat" matrix.

For the cluster-robust estimator, two instead of one corrections are made. If we have $m$ clusters in the data, the first correction is made by multiplying the expression $(X'X)^{-1}X'\Omega X (X'X)^{-1}$ by $m / (m - 1)$ to account for the finite number of clusters. The second degree-of-freedom correction is done in the same way as that for the heteroskedasticity-robust estimator, which is again multiplied to the expression. Thus, for the $HC1$ type of cluster-robust variance covariance matrix estimator, we would estimate $X'\Omega X$ by [\left( \frac{n - 1}{n - k} \right)\left(\frac{m}{m - 1}\right)X'\hat\Omega X,] where $\hat\Omega$ is a diagonal matrix with entries ${e_i^2}$ (notice the slight difference, where we use $n - 1$ instead of $n$ in the first correction. Under this formulation the cluster-robust covariance matrix will reduce to the heteroskedasticity-robust covariance matrix if we treat each individual observation as its own cluster).



baruuum/jars documentation built on Nov. 3, 2019, 2:06 p.m.