README.md
In rdrr1990/bigKRLS: Optimized Kernel Regularized Least Squares

bigKRLS

Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2013), and designed to minimize parametric assumptions while maintaining interpretive clarity. Here, we introduce bigKRLS, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly estimate pairwise regression models with KRLS once N > 2500. Since April 15, 2017, bigKRLS has been available on CRAN. You may also be interested in our working paper, which has been accepted by Political Analysis, and which demonstrates the utility of bigKRLS by analyzing the 2016 US presidential election. Our replication materials can be found on Dataverse and our Github repo contains examples too.

Major Updates found in bigKRLS

C++ integration. We re-implement most major computations in the model in C++ via Rcpp and RcppArmadillo. These changes produce up to a 50% runtime decrease compared to the original R implementation.
Leaner algorithm. Because of the Tikhonov regularization and parameter tuning strategies used in KRLS, the method of estimation is inherently memory-heavy (O(N2)), making memory savings important even in small- and medium-sized applications. We develop and implement a new marginal effects algorithm, which reduces peak memory usage by approximately an order of magnitude, and cut the number of computations needed to find regularization parameter in half.
Improved memory management. Most data objects in R perform poorly in memory-intensive applications. We use a series of packages in the bigmemory environment to ease this constraint, allowing our implementation to handle larger datasets more smoothly.
Parallel Processing. In addition to the single-core algorithmic improvements, parallel processing obtains the pointwise marginal effects substantially faster.
Interactive data visualization. We've designed an R Shiny app that allows users bigKRLS users to easily share results with collaborators or more general audiences. Simply call shiny.bigKRLS().
Honest p values. bigKRLS now computes p values that reflect both the regularization process and the number of predictors. For details on how the effective sample size is calculated as well as other options, see help(summary.bigKRLS).

out <- bigKRLS(y, X)
out$Neffective
summary(out)

Cross-validation, including K folds crossvalidation. crossvalidate.bigKRLS performs CV, stores a number of in and out of sample statistics, as well as metadata documenting how data the were split and the bigmemory file structure (if applicable).

cv <- crossvalidate.bigKRLS(y, X, seed = 2017, ptesting = 20)
kcv <- crossvalidate.bigKRLS(y, X, seed = 2017, Kfolds = 5)

See vignette("bigKRLS_basics") for details.

Eigentruncation. bigKRLS now supports two types of eigentruncation to decrease runtime.

out <- bigKRLS(y, X, eigtrunc = 0.001)     # defaults to 0.001 if N > 3000 and 0 otherwise
out <- bigKRLS(y, X, Neig = 100)           # only compute 100 vecs and vals (defaults to Neig = nrow(X))

Installation

bigKRLS requires a series of packages--notably bigmemory, Rcpp, and RcppArmadillo--current versions of which require up-to-date versions of R and its compilers (RStudio, if used, must be current as well). To install the latest stable version from CRAN:

install.packages("bigKRLS")

To install the GitHub version, use standard devtools syntax:

install.packages("devtools")
library(devtools)
install_github('rdrr1990/bigKRLS')

New users may wish to see our installation notes for specifics

Getting Going...

For details on syntax, load the library and then open our vignette:

library(bigKRLS)
vignette("bigKRLS_basics")

Because of the quadratic memory requirement, users working on a typical laptop (8-16 gigabytes of RAM) may wish to start at N = 2,500 or 5,000, particularly if the number of x variables is large. When you have a sense of how bigKRLS runs on your system, you may wish to only estimate a subset of the marginal effects at N = 10-15,000 by setting bigKRLS(..., which.derivatives = c(1, 3, 5)) for the marginal effects of the first, third, and fifth x variable.