xvalidate: Implementing Cross Validation
In tomzylkin/penppml: Penalized Poisson Pseudo Maximum Likelihood Regression

xvalidate

R Documentation

Implementing Cross Validation

Description

This is the internal function called by mlfitppml_int to perform cross-validation, if the option is enabled. It is available also on a stand-alone basis in case it is needed, but generally users will be better served by using the wrapper mlfitppml.

Usage

xvalidate(
  y,
  x,
  fes,
  IDs,
  testID = NULL,
  tol = 1e-08,
  hdfetol = 1e-04,
  colcheck_x = TRUE,
  colcheck_x_fes = TRUE,
  init_mu = NULL,
  init_x = NULL,
  init_z = NULL,
  verbose = FALSE,
  cluster = NULL,
  penalty = "lasso",
  method = "placeholder",
  standardize = TRUE,
  penweights = rep(1, ncol(x_reg)),
  lambda = 0
)

Arguments

`y`	Dependent variable (a vector)
`x`	Regressor matrix.
`fes`	List of fixed effects.
`IDs`	A vector of fold IDs for k-fold cross validation. If left unspecified, each observation is assigned to a different fold (warning: this is likely to be very resource-intensive).
`testID`	Optional. A number indicating which ID to hold out during cross-validation. If left unspecified, the function cycles through all IDs and reports the average RMSE.
`tol`	Tolerance parameter for convergence of the IRLS algorithm.
`hdfetol`	Tolerance parameter for the within-transformation step, passed on to `collapse::fhdwithin`.
`colcheck_x`	Logical. If `TRUE`, this checks collinearity between the independent variables and drops the collinear variables.
`colcheck_x_fes`	Logical. If `TRUE`, this checks whether the independent variables are perfectly explained by the fixed effects drops those that are perfectly explained.
`init_mu`	Optional: initial values of the conditional mean `\mu`, to be used as weights in the first iteration of the algorithm.
`init_x`	Optional: initial values of the independent variables.
`init_z`	Optional: initial values of the transformed dependent variable, to be used in the first iteration of the algorithm.
`verbose`	Logical. If `TRUE`, it prints information to the screen while evaluating.
`cluster`	Optional: a vector classifying observations into clusters (to use when calculating SEs).
`penalty`	A string indicating the penalty type. Currently supported: "lasso" and "ridge".
`method`	The user can set this equal to "plugin" to perform the plugin algorithm with coefficient-specific penalty weights (see details). Otherwise, a single global penalty is used.
`standardize`	Logical. If `TRUE`, x variables are standardized before estimation.
`penweights`	Optional: a vector of coefficient-specific penalties to use in plugin lasso when `method == "plugin"`.
`lambda`	Penalty parameter, to be passed on to penhdfeppml_int or penhdfeppml_cluster_int.

Details

xvalidate carries out cross-validation with the user-provided IDs by holding out each one of them, sequentially, as in the k-fold procedure (unless testID is specified, in which case it just uses this ID for validation). After filtering out the holdout sample, the function simply calls penhdfeppml_int and penhdfeppml_cluster_int to estimate the coefficients, it predicts the conditional means for the held-out observations and finally it calculates the root mean squared error (RMSE).

Value

A list with two elements:

rmse: root mean squared error (RMSE).
mu: conditional means.

References

Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin (2021). "Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements", Policy Research Working Paper; No. 9629. World Bank, Washington, DC.

Correia, S., P. Guimaraes and T. Zylkin (2020). "Fast Poisson estimation with high dimensional fixed effects", STATA Journal, 20, 90-115.

Gaure, S (2013). "OLS with multiple high dimensional category variables", Computational Statistics & Data Analysis, 66, 8-18.

Friedman, J., T. Hastie, and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent", Journal of Statistical Software, 33, 1-22.

Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016). "Inference in high dimensional panel models with an application to gun control", Journal of Business & Economic Statistics, 34, 590-605.

Examples

# First, we need to transform the data. Start by filtering the data set to keep only countries in
# the Americas:
americas <- countries$iso[countries$region == "Americas"]
trade <- trade[(trade$imp %in% americas) & (trade$exp %in% americas), ]
# Now generate the needed x, y and fes objects:
y <- trade$export
x <- data.matrix(trade[, -1:-6])
fes <- list(exp_time = interaction(trade$exp, trade$time),
            imp_time = interaction(trade$imp, trade$time),
            pair     = interaction(trade$exp, trade$imp))
# We also need to create the IDs. We split the data set by agreement, not observation:
id <- unique(trade[, 5])
nfolds <- 10
unique_ids <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE))
cross_ids <- merge(trade[, 5, drop = FALSE], unique_ids, by = "id", all.x = TRUE)
# Finally, we try xvalidate with a lasso penalty (the default) and two lambda values:
## Not run: reg <- xvalidate(y = y, x = x, fes = fes, lambda = 0.001,
                         IDs = cross_ids$fold, verbose = TRUE)
## End(Not run)

tomzylkin/penppml documentation built on Feb. 13, 2025, 9:10 p.m.