krls: Main function of KRLS

Description Usage Arguments Details

View source: R/krls2.R

Description

This is the primary fitting function. By default it uses squared loss (loss="leastsquares") as one would for a continuous outcome, but now also implements logistic regression with the loss="logistic" option. It also allows faster computation and larger training sets than prior versions by an option to approximate the kernel matrix with lower dimensional approximation using the truncate argument.

The workflow for using KRLS mimics that of lm and similar functions: a krls object of class KRLS2 is fitted in one step, then can later be examined using summary(). The krls object contains all the information that may be needed at summary time, including information required to estimate pointwise partial derivatives, their average for each covariate, standard errors, etc. via the summary() function. See summary.krls2().

Usage

1
2
3
4
5
6
7
krls(X, y, w = NULL, loss = "leastsquares", whichkernel = "gaussian",
  b = NULL, bstart = NULL, binterval = c(10^-8, 500 * ncol(X)),
  lambda = NULL, hyperfolds = 5, lambdastart = 10^(-6)/length(y),
  lambdainterval = c(10^-8, 25), L = NULL, U = NULL, tol = NULL,
  truncate = FALSE, epsilon = NULL, lastkeeper = NULL, con = list(maxit
  = 500), returnopt = FALSE, printlevel = 0, warn = 1, sigma = NULL,
  ...)

Arguments

X

N by D data numeric matrix that contains the values of D predictor variables for i=1,…,N observations. The matrix may not contain missing values or constants. Note that no intercept is required for the least squares or logistic loss. In the case of least squares, the function operates on demeaned data and subtracting the mean of y is equivalent to including an (unpenalized) intercept into the model. In the case of logistic loss, we automatically estimate an unpenalized intercept in the linear component of the model.

y

N by 1 data numeric matrix or vector that contains the values of the response variable for all observations. This vector may not contain missing values, and in the case of logistic loss should be a vector of 0s and 1s.

w

N by 1 data numeric matrix or vector that contains the weights that should applied to each observation. These need not sum to one.

loss

String vector that specifies the loss function. For KRLS, use leastsquares and for KRLogit, use logistic.

whichkernel

String vector that specifies which kernel should be used. Must be one of gaussian, linear, poly1, poly2, poly3, or poly4 (see details). Default is gaussian.

b

A positive scalar (formerly sigma) that specifies the bandwidth of the Gaussian kernel (see gausskernel for details). By default, the bandwidth is set equal to 2D (twice the number of dimensions) which typically yields a reasonable scaling of the distances between observations in the standardized data that is used for the fitting. You can also pass a numeric vector to do a grid search over possible b values.

bstart

A positive scalar that is the starting value for a numerical estimation of the b parameter using cross validation error, overriding the default. If b is specified as an argument, bstart is ignored.

binterval

A numeric vector of length two that specifies the minimum and maxmum b values to look over with optimize when optimizing only the b hyperparameter. Both values must be strictly positive. Only used with logistic loss and when b is NULL. Defaults to c(10^-8, 500*p). This is for use with numerical optimization, if you want to do a grid search, instead pass a numerical vector to the lambda argument.

lambda

A positive scalar that specifies the lambda parameter for the regularizer (see details). It governs the tradeoff between model fit and complexity. By default, this parameter is chosen by minimizing the sum of the squared leave-one-out errors for KRLS and by minimizing the sum of cross-validation negative log likelihood for KRLogit, with the number of folds set by hyperfolds. When using logistic loss, lambda can also be a numeric vector of positive scalars in which case a line search over these values will be used to choose lambda.

hyperfolds

A positive scalar that sets the number of folds used in selecting lambda or b via cross-validation.

lambdastart

A positive scalar that specifices the starting value for a numerical optimization of lambda. Only is used when jointly optimizing over lambda and b

lambdainterval

A numeric vector of length two that specifies the minimum and maxmum lambda values to look over with optimize. Both values must be strictly positive. Only used with logistic loss and when lambda is NULL. Defaults to c(10^-8, 25). This is for use with numerical optimization, if you want to do a grid search, instead pass a numerical vector to the lambda argument.

L

Non-negative scalar that determines the lower bound of the search window for the leave-one-out optimization to find lambda with least squares loss. Default is NULL which means that the lower bound is found by using an algorithm outlined in lambdaline. Ignored with logistic loss.

U

Positive scalar that determines the upper bound of the search window for the leave-one-out optimization to find lambda with least squares loss. Default is NULL which means that the upper bound is found by using an algorithm outlined in lambdaline. Ignored with logistic loss.

tol

Positive scalar that determines the tolerance used in the optimization routine used to find lambda with least squares loss. Default is NULL which means that convergence is achieved when the difference in the sum of squared leave-one-out errors between the i and the i+1 iteration is less than N * 10^-3. Ignored with logistic loss.

truncate

A boolean that defaults to FALSE. If TRUE truncates the kernel matrix, keeping as many eigenvectors as needed so that 1-epsilon of the total variance in the kernel matrix is retained. Alternatively, you can simply specify epsilon and truncation will be used.

epsilon

Scalar between 0 and 1 that determines the total variance that can be lost in truncation. If not NULL, truncation is automatically set to TRUE. If truncate == TRUE, default is 0.001.

lastkeeper

Number of columns of U to keep when truncate == TRUE. Overrides epsilon.

con

A list of control arguments passed to optimization for the numerical optimization of the kernel regularized logistic loss function.

returnopt

A boolean that defaults to FALSE. If TRUE, returns the result of the optim method called to optimize the kernel regularized logistic loss function. Returns NULL with leastsquares loss.

printlevel

A number that is either 0 (default), 1, or 2. 0 Has minimal printing, 1 prints out most diagnostics, and 2 prints out most diagnostics including optim diagnostics for each fold in the cross-validation selection of hyperparameters.

warn

A number that sets your warn option. We default to 1 so that warnings print as they occur. You can change this to 2 if you want all warnings to be errors, to 0 if you want all warnings to wait until the top-level call is finished, or to a negative number to ignore them.

sigma

DEPRECATED. Users should now use b, included for backwards compatability.

Details

krls implements the Kernel-based Regularized Least Squares (KRLS) estimator as described in Hainmueller and Hazlett (2014). Please consult this reference for any details. Kernel-based Regularized Least Squares (KRLS) arises as a Tikhonov minimization problem with a squared loss. Assume we have data of the from y_i, x_i where i indexes observations, y_i in R is the outcome and x_i in R^D is a D-dimensional vector of predictor values. Then KRLS searches over a space of functions H and chooses the best fitting function f according to the rule:

argmin_{f in H} sum_i^N (y_i - f(x_i))^2 + lambda || f ||_H^2

where (y_i - f(x_i))^2 is a loss function that computes how ‘wrong’ the function is at each observation i and || f ||_H^2 is the regularizer that measures the complexity of the function according to the L_2 norm ||f||^2 = int f(x)^2 dx. lambda is the scalar regularization parameter that governs the tradeoff between model fit and complexity. By default, lambda is chosen by minimizing the sum of the squared leave-one-out errors, but it can also be specified by the user in the lambda argument to implement other approaches.

Under fairly general conditions, the function that minimizes the regularized loss within the hypothesis space established by the choice of a (positive semidefinite) kernel function k(x_i,x_j) is of the form

f(x_j)= sum_i^N c_i k(x_i,x_j)

where the kernel function k(x_i,x_j) measures the distance between two observations x_i and x_j and c_i is the choice coefficient for each observation i. Let K be the N by N kernel matrix with all pairwise distances K_ij=k(x_i,x_j) and c be the N by 1 vector of choice coefficients for all observations then in matrix notation the space is y=Kc.

Accordingly, the krls function solves the following minimization problem

argmin_{f in H} sum_i^n (y - Kc)'(y-Kc)+ lambda c'Kc

which is convex in c and solved by c=(K +lambda I)^-1 y where I is the identity matrix. Note that this linear solution provides a flexible fitted response surface that typically reduces misspecification bias because it can learn a wide range of nonlinear and or nonadditive functions of the predictors. In an extension, Hazlett and Sonnet consier a logistic loss function, details of which are forthcoming.

If vcov=TRUE is specified, krls also computes the variance-covariance matrix for the choice coefficients c and fitted values y=Kc based on a variance estimator developed in Hainmueller and Hazlett (2014). Note that both matrices are N by N and therefore this results in increased memory and computing time.

By default, krls uses the Gaussian Kernel (whichkernel = "gaussian") given by

k(x_i,x_j)=exp(-|| x_i - x_j ||^2 / sigma^2)

where ||x_i - x_j|| is the Euclidean distance. The kernel bandwidth sigma^2 is set to D, the number of dimensions, by default, but the user can also specify other values using the sigma argument to implement other approaches.

If binary=TRUE is also specified, the function will identify binary predictors and return first differences for these predictors instead of partial derivatives. First differences are computed going from the minimum to the maximum value of each binary predictor. Note that first differences are more appropriate to summarize the effects for binary predictors (see Hainmueller and Hazlett (2014) for details).

A few other kernels are also implemented, but derivatives are currently not supported for these: "linear": k(x_i,x_j)=x_i'x_j, "poly1", "poly2", "poly3", "poly4" are polynomial kernels based on k(x_i,x_j)=(x_i'x_j +1)^p where p is the order.


lukesonnet/KRLS documentation built on May 21, 2019, 8:58 a.m.