bigKRLS basics

bigKRLS

Complex models are of increasing interest to social scientists. Researchers interested in prediction generally favor flexible, robust approaches, while those interested in causation are often interested in modeling nuanced treatment structures and confounding relationships. Unfortunately, estimators of complex models often scale poorly, especially if they seek to maintain interpretability.

Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2013), which provides a good example of this kind of tradeoff. KRLS offers a desirable balance of interpretability, flexibility, and theoretical guarantees, primarily through the pointwise marginal derivative estimates produced by the estimation routine and their corresponding averages. However, these pointwise marginal derivatives are costly in time and memory to estimate.

Here, we introduce bigKRLS, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly fit KRLS models to medium and large data sets (N > ~2,500).

Regression with bigKRLS

The bigKRLS() function is the workhorse of this package. There are only two basic inputs: a vector of N observations on the dependent variable, y, and an N x P matrix X, where P is the number of independent variables and also ncol(X).^[X and y should only contain numeric data (no missing data, factors, or vectors of constants) and may be base R matrices or "big" matrices (from bigmemory).]

Begin by loading the mtcars data:

library(bigKRLS)
mtcars[1:5,]

Suppose we want to regress fuel efficiency on the other observables. Unlike classical regression, the KRLS algorithm does not directly estimate slope coefficients for particular variables. However, we can recover a similar set of quantities after estimation. In particular, the KRLS functional form allows users to calculate the marginal derivative dy/dx~p~ at each observation, which can then be inspected or averaged to characterize the effect of each variable.

To estimate a model, use the bigKRLS() function:

reg.out <- bigKRLS(y = as.matrix(mtcars$mpg), 
                   X = as.matrix(mtcars[,-1]), Ncores = 1)

Inspect the results using summary():

summary(reg.out)

The model's average marginal effect estimates (AMEs) can be interpreted similarly to the regression coefficients in a linear model, though estimates generated using the two methods may vary substantially depending on the level of effect heterogeneity present. The "Percentiles of the Marginal Effects" can be interpreted as evidence about whether y is a monotonic function of x~p~ and the extent to which the effect of x~p~ on y is homogeneous, if at all. In this toy data set, the number of cylinders is not a statistically significant predictor of fuel efficiency. Perhaps unsurprisingly, the marginal effect of cylinders is negative for about half of the cars investigated. By contrast, horsepower has a more uniformly negative effect on fuel efficiency.

Suppose a user wanted to plot how similar a Toyota Corolla is to the other four cylinder cars:

s <- reg.out$K[which(mtcars$cyl == 4), grep("Corolla", rownames(mtcars))]
barplot(s, main = "Similarity to a Toyota Corolla", 
        ylab = "Kernel", sub="Toy Data from mtcars",  cex.names = .7,
        col = colorRampPalette(c("red", "blue"))(length(s))[rank(s)],
        names.arg = lapply(strsplit(rownames(mtcars), split=" "), 
                           function(x) x[2])[which(mtcars$cyl == 4)])

Unsurprisingly, a Corolla is more similar to a Civic than a Porsche 914.

Marginal effects

As noted above, fuel efficiency appears to be negatively related to horsepower. However, the effect of fuel efficiency on horsepower may not be monotonic:

scatter.smooth(mtcars$hp, reg.out$derivatives[,3], ylab="HP's Effect", xlab="Horsepower", pch = 19, bty = "n",
               main="Horsepower's Marginal Effect on Fuel Efficiency",
               col = colorRampPalette(c("blue", "red"))(nrow(mtcars))[rank(reg.out$coeffs^2)], 
               ylim = c(-0.042, 0.015), xlim = c(50, 400))
abline(h=0, lty='dashed')

The above graph suggests that, though lower-horsepower cars are generally more fuel-efficient, beyond a certain threshold that relationship weakens.

Cross-Validation

For users interested in out-of-sample predictive accuracy, bigKRLS() also provides a convenient crossvalidation function:

CV.out <- crossvalidate.bigKRLS(y = as.matrix(mtcars$mpg), seed = 123, Kfolds = 4,
                   X = as.matrix(mtcars[,-1]), Ncores = 1)
cor(CV.out$fold_3$tested$predicted, CV.out$fold_3$tested$ytest)

For fold 3, the predicted and actual values correlate at r round(cor(CV.out$fold_3$tested$predicted, CV.out$fold_3$tested$ytest), 2). To see a summary of the results, use summary():

summary(CV.out$fold_1$trained) # not run

CV.out contains several test statistics including:

CV.out$MSE_is
CV.out$MSE_oos
CV.out$R2_oos
CV.out$R2AME_oos

The first two test statatistics are the in- and out-of-sample mean squared error. The second two statistics show the performance of the full model (the N coefficients) compared with the portion that is linear and additive in X. To do a simple train and test, leave Kfolds blank and set ptesting instead.

Shiny

To interact with your results in a pop up window or your browser, simply call:

shiny.bigKRLS(reg.out)         # not run

To remove the big square matrices so that you can easily put your results up on a server, use export:

shiny.bigKRLS(reg.out, export = T)         # not run

The output will describe the new, more compact object that has been created.

Predicting with Out-of-Sample Data

Suppose a user wanted to know what percentage of cars would have lower gas mileage if they had 200 horsepower.

Xnew <- mtcars[,-1]
Xnew$hp <- 200
forecast <- predict(reg.out, as.matrix(Xnew))
mean(forecast$predicted < mtcars$mpg)

Approximately r round(100*mean(forecast$predicted < mtcars$mpg), 1)\% of cars would be less efficient in this scenario.

"Big" File Management

When working with large datasets (N > 2,500), bigKRLS uses bigmemory objects for data management. As a result, calling save() and load() on a bigKRLS object will crash your R session. Instead you may do one of two things to save:

out <- bigKRLS(y, X, model_subfolder_name = "my_results") # not run
save.bigKRLS(out, "my_results") # not run

Either will save the model estimates to a new subfolder called "my_results" in your current working directory. To re-load:

load.bigKRLS("my_results") # not run

When N > 2,500, the bigKRLS() function will return most model outputs as bigmemory objects, which are really just memory addresses.

Z <- big.matrix(nrow=5, ncol=5, init=1)
Z

To recover the contents of a bigmatrix, use the square brackets operator:

Z[]


Try the bigKRLS package in your browser

Any scripts or data that you put into this service are public.

bigKRLS documentation built on Aug. 3, 2019, 1:02 a.m.