Description Usage Arguments Details Value Note Author(s) References See Also Examples
Function implements KernelBased Regularized Least Squares (KRLS), a machine learning method described in Hainmueller and Hazlett (2014) that allows users to solve regression and classification problems without manual specification search and strong functional form assumptions. KRLS finds the best fitting function by minimizing a Tikhonov regularization problem with a squared loss, using Gaussian Kernels as radial basis functions. KRLS reduces misspecification bias since it learns the functional form from the data. Yet, it nevertheless allows for interpretability and inference in ways similar to ordinary regression models. In particular, KRLS provides closedform estimates for the predicted values, variances, and the pointwise partial derivatives that characterize the marginal effects of each independent variable at each data point in the covariate space. The distribution of pointwise marginal effects can be used to examine effect heterogeneity and or interactions.
1 2 3 
X 
N by D data numeric matrix that contains the values of D predictor variables for i=1,…,N observations. The matrix may not contain missing values or constants. Note that no intercept is required since the function operates on demeaned data and subtracting the mean of y is equivalent to including an (unpenalized) intercept into the model. 
y 
N by 1 data numeric matrix or vector that contains the values of the response variable for all observations. This vector may not contain missing values. 
whichkernel 
String vector that specifies which kernel should be used. Must be one of 
lambda 
A positive scalar that specifies the lambda parameter for the regularizer (see details). It governs the tradeoff between model fit and complexity. By default, this parameter is chosen by minimizing the sum of the squared leaveoneout errors. 
sigma 
A positive scalar that specifies the bandwidth of the Gaussian kernel (see 
derivative 
Logical that specifies whether pointwise partial derivatives should be computed. Currently, derivatives are only implemented for the Gaussian Kernel. 
binary 
Logical that specifies whether firstdifferences instead of pointwise partial derivatives should be computed for binary predictors. Ignored unless 
vcov 
Logical that specifies whether variancecovariance matrix for the choice coefficients c and fitted values should be computed. Note that 
print.level 
Positive integer that determines the level of printing. Set to 0 for no printing and 2 for more printing. 
L 
Nonnegative scalar that determines the lower bound of the search window for the leaveoneout optimization to find lambda. Default is 
U 
Positive scalar that determines the upper bound of the search window for the leaveoneout optimization to find lambda. Default is 
tol 
Positive scalar that determines the tolerance used in the optimization routine used to find lambda. Default is 
eigtrunc 
Positive scalar that determines how much eignvalues should be trunacted for finding the upper bound of the search window in the algorithm outlined in 
krls
implements the Kernelbased Regularized Least Squares (KRLS) estimator as described in Hainmueller and Hazlett (2014). Please consult this reference for any details.
Kernelbased Regularized Least Squares (KRLS) arises as a Tikhonov minimization problem with a squared loss. Assume we have data of the from y_i, x_i where i indexes observations, y_i in R is the outcome and x_i in R^D is a Ddimensional vector of predictor values. Then KRLS searches over a space of functions H and chooses the best fitting function f according to the rule:
argmin_{f in H} sum_i^N (y_i  f(x_i))^2 + lambda  f _H^2
where (y_i  f(x_i))^2 is a loss function that computes how ‘wrong’ the function
is at each observation i and  f _H^2 is the regularizer that measures the complexity of the function according to the L_2 norm f^2 = int f(x)^2 dx. lambda is the scalar regularization parameter that governs the tradeoff between model fit and complexity. By default, lambda is chosen by minimizing the sum of the squared leaveoneout errors, but it can also be specified by the user in the lambda
argument to implement other approaches.
Under fairly general conditions, the function that minimizes the regularized loss within the hypothesis space established by the choice of a (positive semidefinite) kernel function k(x_i,x_j) is of the form
f(x_j)= sum_i^N c_i k(x_i,x_j)
where the kernel function k(x_i,x_j) measures the distance between two observations x_i and x_j and c_i is the choice coefficient for each observation i. Let K be the N by N kernel matrix with all pairwise distances K_ij=k(x_i,x_j) and c be the N by 1 vector of choice coefficients for all observations then in matrix notation the space is y=Kc.
Accordingly, the krls
function solves the following minimization problem
argmin_{f in H} sum_i^n (y  Kc)'(yKc)+ lambda c'Kc
which is convex in c and solved by c=(K +lambda I)^1 y where I is the identity matrix. Note that this linear solution provides a flexible fitted response surface that typically reduces misspecification bias because it can learn a wide range of nonlinear and or nonadditive functions of the predictors.
If vcov=TRUE
is specified, krls
also computes the variancecovariance matrix for the choice coefficients c and fitted values y=Kc based on a variance estimator developed in Hainmueller and Hazlett (2014). Note that both matrices are N by N and therefore this results in increased memory and computing time.
By default, krls
uses the Gaussian Kernel (whichkernel = "gaussian"
) given by
k(x_i,x_j)=exp( x_i  x_j ^2 / sigma^2)
where x_i  x_j is the Euclidean distance. The kernel bandwidth sigma^2 is set to D, the number of dimensions, by default, but the user can also specify other values using the sigma
argument to implement other approaches.
If derivative=TRUE
is specified, krls
also computes the pointwise partial derivatives of the fitted function wrt to each predictor using the estimators developed in Hainmueller and Hazlett (2014). These can be used to examine the marginal effects of each predictor and how the marginal effects vary across the covariate space. Average derivatives are also computed with variances. Note that the derivative=TRUE
option results in increased computing time and is only supported for the Gaussian kernel, i.e. when whichkernel = "gaussian"
. Also derivative=TRUE
requires that vcov=TRUE
.
If binary=TRUE
is also specified, the function will identify binary predictors and return first differences for these predictors instead of partial derivatives. First differences are computed going from the minimum to the maximum value of each binary predictor. Note that first differences are more appropriate to summarize the effects for binary predictors (see Hainmueller and Hazlett (2014) for details).
A few other kernels are also implemented, but derivatives are currently not supported for these: "linear": k(x_i,x_j)=x_i'x_j, "poly1", "poly2", "poly3", "poly4" are polynomial kernels based on k(x_i,x_j)=(x_i'x_j +1)^p where p is the order.
A list object of class krls
with the following elements:
K 
N by N matrix of pairwise kernel distances between observations. 
coeffs 
N by 1 vector of choice coefficients c. 
Le 
scalar with sum of squared leaveoneout errors. 
fitted 
N by 1 vector of fitted values. 
X 
original N by D predictor data matrix. 
y 
original N by 1 matrix of values of the outcome variable. 
sigma 
scalar with value of bandwidth, sigma^2, used for the Gaussian kernel. 
lambda 
scalar with value of regularization parameter, lambda, used (user specified or based on leaveoneout crossvalidation). 
R2 
scalar with value of Rsquared 
vcov.c 
N by N variance covariance matrix for choice coefficients ( 
vcov.fitted 
N by N variance covariance matrix for fitted values ( 
derivatives 
N by D matrix of pointwise partial derivatives based on the Gaussian kernel ( 
avgderivatives 
1 by D matrix of average derivative based on the Gaussian kernel ( 
var.avgderivatives 
1 by D matrix of variances for average derivative based on gaussian kernel ( 
binaryindicator 
1 by D matrix that indicates for each predictor if it is treated as binary or not (evaluates to FALSE unless 
The function requires the storage of a N by N kernel matrix and can therefore exceed the memory limits for very large datasets.
Setting derivative=FALSE
and vcov=FALSE
is useful to reduce computing time if pointwise partial derivatives and or variance covariance matrices are not needed.
Jens Hainmueller (Stanford) and Chad Hazlett (MIT)
Jeremy Ferwerda, Jens Hainmueller, Chad J. Hazlett (2017). KernelBased Regularized Least Squares in R (KRLS) and Stata (krls). Journal of Statistical Software, 79(3), 126. doi:10.18637/jss.v079.i03
Hainmueller, J. and Hazlett, C. (2014). Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach. Political Analysis, 22(2)
Rifkin, R. 2002. Everything Old is New Again: A fresh look at historical approaches in machine learning. Thesis, MIT. September, 2002.
Evgeniou, T., Pontil, M., and Poggio, T. (2000). Regularization networks and support vector machines. Advances In Computational Mathematics, 13(1):150.
Schoelkopf, B., Herbrich, R. and Smola, A.J. (2001) A generalized representer theorem. In 14th Annual Conference on Computational Learning Theory, pages 416426.
Kimeldorf, G.S. Wahba, G. 1971. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33:8295.
predict.krls
for fitted values and predictions. summary.krls
for summary of the fit. plot.krls
for plots of the fit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74  # Linear example
# set up data
N < 200
x1 < rnorm(N)
x2 < rbinom(N,size=1,prob=.2)
y < x1 + .5*x2 + rnorm(N,0,.15)
X < cbind(x1,x2)
# fit model
krlsout < krls(X=X,y=y)
# summarize marginal effects and contribution of each variable
summary(krlsout)
# plot marginal effects and conditional expectation plots
plot(krlsout)
# nonlinear example
# set up data
N < 200
x1 < rnorm(N)
x2 < rbinom(N,size=1,prob=.2)
y < x1^3 + .5*x2 + rnorm(N,0,.15)
X < cbind(x1,x2)
# fit model
krlsout < krls(X=X,y=y)
# summarize marginal effects and contribution of each variable
summary(krlsout)
# plot marginal effects and conditional expectation plots
plot(krlsout)
## 2D example:
# predictor data
X < matrix(seq(3,3,.1))
# true function
Ytrue < sin(X)
# add noise
Y < sin(X) + rnorm(length(X),sd=.3)
# approximate function using KRLS
out < krls(y=Y,X=X)
# get fitted values and ses
fit < predict(out,newdata=X,se.fit=TRUE)
# results
par(mfrow=c(2,1))
plot(y=Ytrue,x=X,type="l",col="red",ylim=c(1.2,1.2),lwd=2,main="f(x)")
points(y=fit$fit,X,col="blue",pch=19)
arrows(y1=fit$fit+1.96*fit$se.fit,
y0=fit$fit1.96*fit$se.fit,
x1=X,x0=X,col="blue",length=0)
legend("bottomright",legend=c("true f(x)=sin(x)","KRLS fitted f(x)"),
lty=c(1,NA),pch=c(NA,19),lwd=c(2,NA),col=c("red","blue"),cex=.8)
plot(y=cos(X),x=X,type="l",col="red",ylim=c(1.2,1.2),lwd=2,main="df(x)/dx")
points(y=out$derivatives,X,col="blue",pch=19)
legend("bottomright",legend=c("true df(x)/dx=cos(x)","KRLS fitted df(x)/dx"),
lty=c(1,NA),pch=c(NA,19),lwd=c(2,NA),col=c("red","blue"),,cex=.8)
## 3D example
# plot true function
par(mfrow=c(1,2))
f<function(x1,x2){ sin(x1)*cos(x2)}
x1 < x2 <seq(0,2*pi,.2)
z <outer(x1,x2,f)
persp(x1, x2, z,theta=30,main="true f(x1,x2)=sin(x1)cos(x2)")
# approximate function with KRLS
# data and outcomes
X < cbind(sample(x1,200,replace=TRUE),sample(x2,200,replace=TRUE))
y < f(X[,1],X[,2])+ runif(nrow(X))
# fit surface
krlsout < krls(X=X,y=y)
# plot fitted surface
ff < function(x1i,x2i,krlsout){predict(object=krlsout,newdata=cbind(x1i,x2i))$fit}
z < outer(x1,x2,ff,krlsout=krlsout)
persp(x1, x2, z,theta=30,main="KRLS fitted f(x1,x2)")

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.