sgdgmf.cv: Model selection via cross-validation for generalized matrix...

sgdgmf.cvR Documentation

Model selection via cross-validation for generalized matrix factorization models

Description

K-fold cross-validation for generalized matrix factorization (GMF) models.

Usage

sgdgmf.cv(
  Y,
  X = NULL,
  Z = NULL,
  family = gaussian(),
  ncomps = seq(from = 1, to = 10, by = 1),
  weights = NULL,
  offset = NULL,
  method = c("airwls", "newton", "sgd"),
  sampling = c("block", "coord", "rnd-block"),
  penalty = list(),
  control.init = list(),
  control.alg = list(),
  control.cv = list()
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

Z

matrix of column fixed effects (q \times m)

family

a glm family (see family for more details)

ncomps

ranks of the latent matrix factorization used in cross-validation (default 1 to 10)

weights

an optional matrix of weights (n \times m)

offset

an optional matrix of offset values (n \times m), that specify a known component to be included in the linear predictor.

method

estimation method to minimize the negative penalized log-likelihood

sampling

sub-sampling strategy to use if method = "sgd"

penalty

list of penalty parameters (see set.penalty for more details)

control.init

list of control parameters for the initialization (see set.control.init for more details)

control.alg

list of control parameters for the optimization (see set.control.alg for more details)

control.cv

list of control parameters for the cross-validation (see set.control.cv for more details)

Details

Cross-validation is performed by minimizing the estimated out-of-sample error, which can be measured in terms of averaged deviance, AIC or BIC calculated on fold-specific test sets. Within each fold, the test set is defined as a fixed proportion of entries in the response matrix which are held out from the estimation process. To this end, the test set entries are hidden by NA values when training the model. Then, the predicted, i.e. imputed, values are used to compute the fold-specific out-of-sample error.

Value

If refit = FALSE (see set.control.cv), the function returns a list containing control.init, control.alg, control.cv and summary.cv. The latter is a matrix collecting the cross-validation results for each combination of fold and latent dimension.

If refit = TRUE (see set.control.cv), the function returns an object of class sgdgmf, obtained by refitting the model on the whole data matrix using the latent dimension selected via cross-validation. The returned object also contains the summary.cv information along with the other standard output of the sgdgmf.fit function.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Set the data dimensions
n = 100; m = 20; d = 5

# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)

# Set RUN = TRUE to run the example, it may take some time. To speed up
# the computation it is possible to run CV in parallel specifying
# control.cv = list(parallel = TRUE, nthreads = <number_of_workers>)
# as an argument of sgdgmf.cv()
RUN = FALSE
if (RUN) {
  # Initialize the GMF parameters assuming 3 latent factors
  gmf_pois = sgdgmf.cv(data_pois$Y, ncomp = 1:10, family = poisson())
  gmf_bin = sgdgmf.cv(data_bin$Y, ncomp = 3, family = binomial())
  gmf_gam = sgdgmf.cv(data_gam$Y, ncomp = 3, family = Gamma(link = "log"))

  # Get the fitted values in the link and response scales
  mu_hat_pois = fitted(gmf_pois, type = "response")
  mu_hat_bin = fitted(gmf_bin, type = "response")
  mu_hat_gam = fitted(gmf_gam, type = "response")

  # Compare the results
  oldpar = par(no.readonly = TRUE)
  par(mfrow = c(1,3), mar = c(1,1,3,1))
  image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
  image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
  image(mu_hat_pois, axes = FALSE, main = expression(hat(mu)[Pois]))
  image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
  image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
  image(mu_hat_bin, axes = FALSE, main = expression(hat(mu)[Bin]))
  image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
  image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
  image(mu_hat_gam, axes = FALSE, main = expression(hat(mu)[Gam]))
  par(oldpar)
}


sgdGMF documentation built on April 3, 2025, 7:37 p.m.