In sondreus/QuickReg: Quick and Easy OLS Interface

Easy OLS with options in R

options(width = 100)
library(stargazer)
library(zoo)
library(sandwich)
library(lmtest)

The QuickReg package and associated function provides an easy interface for linear regression in R. This includes the option to request robust and clustered standard errors (equivalent to STATA's ", robust" option), automatic labeling, an easy way to specify multiple regression specifications simultaneously, and a compact html or latex output (relying on the widely used "stargazer" package).

QuickReg also includes a new method to speed up OLS computation. In particular, it offers the option to implement a fixed effect demeaning procedure which demeans a set of covariates and then shares this across multiple regression specifications. In tests (reported below), this reduces calculation time by more than 60 percent for analysis with a large number of fixed effects compared to base R. This relative performance gain is increasing in the number of specifications passed to the function simultaneously.

Written by Sondre U. Solstad, Princeton University (ssolstad@princeton.edu). Send me an email if you find this package useful or want to suggest an improvement or feature.

Installation instructions:

library(devtools)
install_github("sondreus/QuickReg")

Example:

library(QuickReg)

# Loading data
mydata <- readRDS("3d_example.RDS")

# Use the QuickReg to produce a regression table     
QuickReg(data = mydata, 
         iv.vars = c("upop", "log_gdppc_mad", "SDI", "SDT", "war", "polity2"), 
         iv.vars.names = c("Urban Population", "Log(GDPPC)", "Spatial distance to income", 
          "Spatial distance to technology", "At War", "Polity2 score"), 
         dv.vars = c("log_adoption_lvl_pc", "distance_to_frontier"), 
         dv.vars.names = c("Technology Adoption Level", "Distance to Frontier"), 
         specifications = list( c(1, 4, 5, 6),
                        c(1, 2, 3, 4),
                        c(1, 3, 4)), 
         fixed.effects = c("ccode", "technology", "year"),
         fixed.effects.names = c("Country FE", "Year FE", "Tech. FE"), 
         robust.se = TRUE,
         type = "html",
         silent = TRUE
         )

See also the resultant html file: QuickReg.html

Arguments:

data - Data frame in which all model variables are located.
iv.vars - Vector of independent variable names in dataset (e.g. c("gdppc", "pop"))
iv.vars.names - (Optional) Vector of desired independent variable names in table output (e.g. c("GDP per capita", "Population")). Defaults to values in "iv.vars" if none provided.
dv.vars - Vector of dependent variable in dataset (e.g. c("democracy", "war"))
dv.vars.names - (Optional) Vector of desired dependent variable names in table output (e.g. c("Democracy (Boix-Rosato-Miller 2012)", "War (with at least 1000 battle deaths)")). Defaults to values in "dv.vars" if none provided.
specifications - (Optional) List of desired regression specifications (selections of independent variables). The list of regression specifications are applied to all dependent variables. E.g. list(c(1), c(1,2), c(2))).
fixed.effects - (Optional) Vector of desired fixed effect variable names in dataset (e.g. c("region", "year"))
fixed.effects.names - (Optional) Vector of desired fixed effects labels in table output (e.g. c("Region FE", "Year FE")). Defaults to values in "fixed.effects" if none provided.
fixed.effects.specifications - (Optional) List of desired fixed effect specifications (selections of independent variables). These specifications are applied in sequence from the first to last model. If the number of specifications is less than the number of models, all fixed effects are applied in the remaining columns by default. If none provided, defaults to all fixed effects in all models.
robust.se - (Optional) If TRUE, returns robust standard errors calculated using a sandwich estimator from the "sandwich" package. Defaults to FALSE (i.e. normal standard errors).
cluster - (Optional) Name of variable in dataset by which cluster-robust standard errors should be computed using the cluster.vcov command of the multiwayvcov package.
cluster.names - (Optional) Desired name or label of clustering variable to be reported in table output (e.g. "Country" yields a note on the bottom of the table reading "Country-Clustered Standard Errors in Parenthesis"). If cluster specified but no "cluster.names" provided, "Cluster-Robust Standard Errors in Parenthesis" is reported.
table.title - (Optional) Specifies the title of the table with regression output. Defaults to "QuickReg" plus the date and time of creation in parenthesis.
out.name - (Optional) Specifies the output file name. Defaults to "QuickReg.html".
dynamic.out.name - (Optional) If TRUE, adds date and time of creation in brackets between the out.name and the file extension (e.g. QuickReg (2017-04-05-14-01-27).html)
html.only - (Optional) If TRUE, no latex output produced (only HTML table). Defaults to FALSE.
type - (Optional) Specifies the type of table output that will be requested from Stargazer. Possible values are: "latex", "html", and "text". Defaults to "latex".
silent - (Optional) If TRUE, no messages are returned by the function. Defaults to FALSE.
save.fits - (Optional) If TRUE, saves fitted lm objects in a list by the name "QuickReg.fits" adding an integer if an object by this name already exists. Defaults to FALSE.
demeaning.acceleration - (Optional) If TRUE, attempts to speed up regression by the method of alternating projections. In particular, it utilizes the "demeanlist" function of the "lfe" package to create a matrix of all covariates demeaned by all fixed effects, and then fits the different regression specifications on this demeaned matrix. Time saved is increasing in the number of fixed effects, specifications and observations, and this method is slower when all these are low. If there are thousands of fixed effects and many specifications, time saved is potentially quite large. Note: Overrides fixed.effects.specifications, always including all variables specified in fixed.effects, and does not supply R-squared or other model statistics. Defaults to FALSE.
... - Various options passed to the stargazer function. In particular: stargazer.digits = integer of number of digits to be displayed, stargazer.font.size = font size (e.g. "tiny") if output is latex (no font size is imposed by default), stargazer.style = table style (see "?stargazer_style_list"), stargazer.omit.stat = character vector of model statistics to be omitted from table output.

Explanation and detail

The QuickReg function is meant to provide a comprehensive and convenient linear regression interface in R. It has been designed with the objective of being intuitive and easy to use at default settings, but with enough options for advanced users. Most importantly, the function is meant to facilitate a smooth, quick and productive workflow.

QuickReg is designed to work seamlessly with knitr and Rmarkdown, and allows output to be requested from stargazer in "latex", "html", or "text" format.

To illustrate the use of QuickReg, consider a researcher considering the linear relationships between a few variables.

N <- 1000
mydata <- cbind.data.frame(rnorm(N), rnorm(N), rnorm(N), rnorm(N), rnorm(N), 
                           rep(seq(1:10), N/10), sample(1:10, N, replace = TRUE))
colnames(mydata) <- c("y", "alternative.y", "x1", "x2", "x3", "group1", "group2")

Let's fit a simple regression in base R:

# Standard R
summary(lm(y ~ x1, data = mydata))

And then in QuickReg:

# QuickReg
QuickReg(data = mydata, dv.vars = "y", iv.vars = "x1", type = "html")

Suppose we also are interested in the effects of "x2" and "x3".

Base R:

# Standard R
summary(lm(y ~ x1 + x2 + x3, data = mydata))

QuickReg:

# QuickReg
QuickReg(data = mydata, dv.vars = "y", iv.vars = c("x1", "x2", "x3"), type = "html")

But what are the unconditional effects of x2 and x3, and how do they compare with x1?

Base R:

# Standard R
summary(lm(y ~ x1, data = mydata))
summary(lm(y ~ x2, data = mydata))
summary(lm(y ~ x3, data = mydata))

QuickReg:

# QuickReg
QuickReg(data = mydata, dv.vars = "y", iv.vars = c("x1", "x2", "x3"), specifications = list(1, 2, 3), 
         type = "html")

Changing or adding specifications is faster in QuickReg, and the results are collected and presented in an easy to read table format instead of being offered one-after-the-other.

We might also want robust standard errors or standard errors clustered at "group1":

Base R: See this guide by Drew Dimmery (it involves specifying a custom function, and then passing fitted models throught the function one at a time).

QuickReg: simply select "robust.se = TRUE" or "cluster ='clustering variable'":

# QuickReg
QuickReg(data = mydata, dv.vars = "y", iv.vars = c("x1", "x2", "x3"), specifications = list(1, 2, 3), 
         type = "html", robust.se = TRUE)

QuickReg(data = mydata, dv.vars = "y", iv.vars = c("x1", "x2", "x3"), specifications = list(1, 2, 3), 
         type = "html", cluster = "group1")

Let us also try a few more combinations using QuickReg:

# QuickReg
QuickReg(data = mydata, dv.vars = "y", iv.vars = c("x1", "x2", "x3"), 
         specifications = list(1, 2, 3, c(1, 3), c(1,2), c(2, 3),  c(1,2,3)), 
         type = "html", robust.se = TRUE)

Or try adding another DV:

# QuickReg
QuickReg(data = mydata, dv.vars = c("y", "alternative.y"), iv.vars = c("x1", "x2", "x3"), specifications = list(1, 2, 3),
         type = "html", robust.se = TRUE)

Or fixed effects:

# QuickReg
QuickReg(data = mydata, dv.vars = c("y", "alternative.y"), iv.vars = c("x1", "x2", "x3"), fixed.effects = c("group1", "group2"), specifications = list(1, 2, 3), 
         type = "html", robust.se = TRUE)

Lastly, we can make it look better by adding labels and titles:

# QuickReg
QuickReg(data = mydata,
         table.title = "My Regression Results", 
         dv.vars = c("y", "alternative.y"), 
         dv.vars.names = c("Outcome", "Alternative Outcome"),
         iv.vars = c("x1", "x2", "x3"),
         iv.vars.names = c("Variable 1", "Variable 2", "Variable 3"),
         fixed.effects = c("group1", "group2"),
         fixed.effects.names = c("Group 1 FE", "Group 2 FE"),
         specifications = list(1, 2, 3), 
         cluster = "group1", 
         cluster.names = "Group 1",
         type = "html")

It is worth noting that despite QuickReg's number of options and different syntax than base R, the function's setup cost is low: tests suggest about 1/5th of a second.

Demeaning Acceleration:

With a large number of fixed effects, regression analysis can take a very long time. QuickReg offers a solution to this problem. First, QuickReg implements the method of alternating projections, which takes advantage of the fact that fixed effects are equivalent to "demeaning" covariates by the levels of the fixed effects. If the number of fixed effects are large, it can be faster to demean than to invert matricies. This procedure is implemented through the demean.list function in the lfe package.

Secondly, and more importantly for speed purposes, one often wants to calculate results for a number of different specifications of IVs and DVs with the same fixed effects and sample of observations. QuickReg is suitable for such cases for several reasons, the first being that it provides a convenient interface for listing specifications, and second because it summarizes model results in the familiar and concise table format with columns corresponding to different models. QuickReg is also able to speed up the calculations of such tables significantly by applying the demeaning procedure to a single covariate matrix shared by all specifications. While standard implementations calculate fixed effects repeatedly for different specifications, it is here only done once, and then shared across all specifications. Two words of caution are in order: (1) this limits calculations to a common sample, and (2) it makes fitted objects' model statistics (e.g. R-squared) meaningless (these are removed from the resultant table automatically). Coefficient confidence intervals can and are however still calculated correctly, and all QuickReg options (including to calculate robust SEs) are available.

library("microbenchmark")

N <- 1000
mydata <- cbind.data.frame(rnorm(N), rnorm(N), rnorm(N), rnorm(N), rnorm(N), 
                           rep(seq(1:100), N/100), sample(1:100, N, replace = TRUE))
colnames(mydata) <- c("y", "alternative.y", "x1", "x2", "x3", "group1", "group2")

# Testing performance gain:
speed.test <- suppressWarnings(microbenchmark(QuickReg(data = mydata, iv.vars = c("x1", "x2", "x3"),
         dv.vars = c("y", "alternative.y"),
         specifications = list( c(1),
                        c(2, 3), 
                        c(1, 2, 3)), 
         fixed.effects = c("group1", "group2"), 
         html.only = TRUE,
         silent = TRUE,
         out.name = "QuickReg.normal",

         # Demeaning acceleration is set to FALSE (default)
         demeaning.acceleration = FALSE
         ),
         QuickReg(data = mydata, iv.vars = c("x1", "x2", "x3"),
         dv.vars = c("y", "alternative.y"),
         specifications = list( c(1),
                        c(2, 3), 
                        c(1, 2, 3)), 
         fixed.effects = c("group1", "group2"), 
         html.only = TRUE,
         silent = TRUE,
         out.name = "QuickReg.fast",

         # Demeaning acceleration is set to TRUE (not default)
         demeaning.acceleration = TRUE
         ), 

         # Specifying number of trails. 
          times = 20))
speed.test

print( paste("Total number of observations:", N))
print( paste( "Total number of fixed effects:", length(unique(mydata[, "group1"])) + length(unique(mydata[, "group2"]))))

In the above example, QuickReg's acceleration reduced the time spent by more than 60 percent relative to the standard R (the "lm()"-function which QuickReg relies on by default). Gains can be even larger when we increase the number of fixed effects, as in the below example:

library("microbenchmark")

N <- 10000
mydata <- cbind.data.frame(rnorm(N), rnorm(N), rnorm(N), rnorm(N), rnorm(N), 
                           rep(seq(1:1000), N/1000), sample(1:100, N, replace = TRUE))
colnames(mydata) <- c("y", "alternative.y", "x1", "x2", "x3", "group1", "group2")

# Testing performance gain:
speed.test <- suppressWarnings(microbenchmark(QuickReg(data = mydata, iv.vars = c("x1", "x2", "x3"),
         dv.vars = c("y", "alternative.y"),
         specifications = list( c(1),
                        c(2, 3), 
                        c(1, 2, 3)), 
         fixed.effects = c("group1", "group2"), 
         html.only = TRUE,
         silent = TRUE,
         out.name = "QuickReg.normal.html",

         # Demeaning acceleration is set to FALSE (default)
         demeaning.acceleration = FALSE
         ),
         QuickReg(data = mydata, iv.vars = c("x1", "x2", "x3"),
         dv.vars = c("y", "alternative.y"),
         specifications = list( c(1),
                        c(2, 3), 
                        c(1, 2, 3)), 
         fixed.effects = c("group1", "group2"), 
         html.only = TRUE,
         silent = TRUE,
         out.name = "QuickReg.fast.html",

         # Demeaning acceleration is set to TRUE (not default)
         demeaning.acceleration = TRUE
         ), 

         # Specifying number of trails. 
          times = 10))
speed.test

print( paste("Total number of observations:", N))
print( paste( "Total number of fixed effects:", length(unique(mydata[, "group1"])) + length(unique(mydata[, "group2"]))))

Acknowledgements

This package relies on the stargazer package by Marek Hlavac, the sandwich package by Thomas Lumley and Achim Zeileis, the lfe package by Simen Gaure, the multivcov package by Nathaniel Graham and Mahmood Arai and Björn Hagströmer, and the lmtest package by Torsten Hothorn, Achim Zeileis, Richard W. Farebrother, Clint Cummins, Giovanni Millo, and David Mitchell.