xvalidate: Implementing Cross Validation

View source: R/xvalidate.R

xvalidateR Documentation

Implementing Cross Validation

Description

This is the internal function called by mlfitppml_int to perform cross-validation, if the option is enabled. It is available also on a stand-alone basis in case it is needed, but generally users will be better served by using the wrapper mlfitppml.

Usage

xvalidate(
  y,
  x,
  fes,
  IDs,
  testID = NULL,
  tol = 1e-08,
  hdfetol = 1e-04,
  colcheck_x = TRUE,
  colcheck_x_fes = TRUE,
  init_mu = NULL,
  init_x = NULL,
  init_z = NULL,
  verbose = FALSE,
  cluster = NULL,
  penalty = "lasso",
  method = "placeholder",
  standardize = TRUE,
  penweights = rep(1, ncol(x_reg)),
  lambda = 0
)

Arguments

y

Dependent variable (a vector)

x

Regressor matrix.

fes

List of fixed effects.

IDs

A vector of fold IDs for k-fold cross validation. If left unspecified, each observation is assigned to a different fold (warning: this is likely to be very resource-intensive).

testID

Optional. A number indicating which ID to hold out during cross-validation. If left unspecified, the function cycles through all IDs and reports the average RMSE.

tol

Tolerance parameter for convergence of the IRLS algorithm.

hdfetol

Tolerance parameter for the within-transformation step, passed on to collapse::fhdwithin.

colcheck_x

Logical. If TRUE, this checks collinearity between the independent variables and drops the collinear variables.

colcheck_x_fes

Logical. If TRUE, this checks whether the independent variables are perfectly explained by the fixed effects drops those that are perfectly explained.

init_mu

Optional: initial values of the conditional mean \mu, to be used as weights in the first iteration of the algorithm.

init_x

Optional: initial values of the independent variables.

init_z

Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm.

verbose

Logical. If TRUE, it prints information to the screen while evaluating.

cluster

Optional: a vector classifying observations into clusters (to use when calculating SEs).

penalty

A string indicating the penalty type. Currently supported: "lasso" and "ridge".

method

The user can set this equal to "plugin" to perform the plugin algorithm with coefficient-specific penalty weights (see details). Otherwise, a single global penalty is used.

standardize

Logical. If TRUE, x variables are standardized before estimation.

penweights

Optional: a vector of coefficient-specific penalties to use in plugin lasso when method == "plugin".

lambda

Penalty parameter, to be passed on to penhdfeppml_int or penhdfeppml_cluster_int.

Details

xvalidate carries out cross-validation with the user-provided IDs by holding out each one of them, sequentially, as in the k-fold procedure (unless testID is specified, in which case it just uses this ID for validation). After filtering out the holdout sample, the function simply calls penhdfeppml_int and penhdfeppml_cluster_int to estimate the coefficients, it predicts the conditional means for the held-out observations and finally it calculates the root mean squared error (RMSE).

Value

A list with two elements:

  • rmse: root mean squared error (RMSE).

  • mu: conditional means.

References

Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.

Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.

Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.

Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.

Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.

Examples

# First, we need to transform the data. Start by filtering the data set to keep only countries in
# the Americas:
americas <- countries$iso[countries$region == "Americas"]
trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ]
# Now generate the needed x, y and fes objects:
y <- trade$export
x <- data.matrix(trade[, -1:-6])
fes <- list(exp_time = interaction(trade$exp, trade$time),
            imp_time = interaction(trade$imp, trade$time),
            pair     = interaction(trade$exp, trade$imp))
# We also need to create the IDs. We split the data set by agreement, not observation:
id <- unique(trade[, 5])
nfolds <- 10
unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE))
cross_ids <- merge(trade[, 5, drop = FALSE], unique_ids, by = "id", all.x = TRUE)
# Finally, we try xvalidate with a lasso penalty (the default) and two lambda values:
## Not run: reg <- xvalidate(y = y, x = x, fes = fes, lambda = 0.001,
                         IDs = cross_ids$fold, verbose = TRUE)
## End(Not run)


penppml documentation built on Sept. 8, 2023, 5:58 p.m.