cv_ssnet: Cross Validation for ssnet Models

View source: R/cv_ssnet.R

cv_ssnetR Documentation

Cross Validation for ssnet Models

Description

Perform k-fold cross validation for spike-and-slab elastic net models.

Usage

cv_ssnet(
  model,
  alpha = c(0.5, 1),
  s0 = seq(0.01, 0.1, 0.01),
  s1 = c(1, 2.5),
  classify = FALSE,
  classify.rule = 0.5,
  nfolds = 10,
  ncv = 1,
  foldid = NULL,
  fold.seed = NULL,
  x,
  y,
  family,
  offset = NULL,
  epsilon = 1e-04,
  maxit = 50,
  init = NULL,
  group = NULL,
  Warning = FALSE,
  verbose = FALSE,
  opt.algorithm = "LBFGS",
  iar.data = NULL,
  iar.prior = FALSE,
  p.bound = c(0.01, 0.99),
  tau.prior = "none",
  stan_manual = NULL,
  lambda.criteria = "lambda.min",
  output_param_est = FALSE,
  type.multinomial = "grouped"
)

Arguments

model

Specify which model to fit. Options include c("glmnet", "ss", "ss_iar").

alpha

A scalar value between 0 and 1 determining the compromise between the Ridge and Lasso models. When alpha = 1 reduces to the Lasso, and when alpha = 0 reduces to Ridge.

s0, s1

A vector of user-selected possible values for the spike scale and slab scale parameter, respectively. The default is s0 = seq(0.01, 0.1, 0.01) and s1 = 1. However, the user should select values informed by the practical context of the analysis.

classify

Logical. When TRUE and family = "binomial" applies a classification rule given by the argument classify.rule, and outputs accuracy, sensitivity, specificity, positive predictive value (ppv), and negative predictive value (npv).

classify.rule

A value between 0 and 1. For a given predicted value from a logistic regression, if the value is above classify.rule, then the predicted class is 1; otherwise the predicted class is 0. The default is 0.5.

nfolds

Numeric value indicating the number of folds to create.

ncv

Numeric value indicating the number of times to perform cross validation.

foldid

An (optional) vector of values between 1 and nfold identifying the fold for each observation. When supplied nfolds may be omitted. If ncv > 1, then supply a matrix or data frame where each column contains fold identifiers. If foldid is supplied, it supersedes ncv and nfolds.

fold.seed

A scalar seed value for cross validation; ensures the folds are the same upon re-running the function. Alternatively, use foldid to manually specify folds.

x

Design, or input, matrix, of dimension nobs x nvars; each row is an observation vector. It is recommended that x have user-defined column names for ease of identifying variables. If missing, then colnames are internally assigned x1, x2, ... and so forth.

y

Scalar response variable. Quantitative for family = "gaussian", or family = "poisson" (non-negative counts). For family = "gaussian", y is always standardized. For family = "binomial", y should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class). For family="cox", y should be a two-column matrix with columns named 'time' and 'status'. The latter is a binary variable, with '1' indicating death, and '0' indicating right censored. The function Surv() in package survival produces such a matrix. When family = "multinomial", y follows the documentation for glmnet, but it is preferred that y is a factor with two or more levels.

family

Response type (see above).

offset

A vector of length nobs that is included in the linear predictor.

epsilon

A positive convergence tolerance; the iterations converge when |dev - dev_old|/(|dev| + 0.1) < e.

maxit

An integer giving the maximal number of EM iterations.

init

A vector of initial values for all coefficients (not for intercept). If not given, it will be internally produced. If family = "multinomial" and the same initializations are desired for each response/outcome category then init can be a vector. If different initializations are desired, then init should be a list, each element of which contains a vector of initializations. The list should be named according the response/outcome category as they appear in y.

group

A numeric vector, or an integer, or a list indicating the groups of predictors. If group = NULL, all the predictors form a single group. If group = K, the predictors are evenly divided into groups each with K predictors. If group is a numberic vector, it defines groups as follows: Group 1: (group[1]+1):group[2], Group 2: (group[2]+1):group[3], Group 3: (group[3]+1):group[4], ... If group is a list of variable names, group[[k]] includes variables in the k-th group. The mixture double-exponential prior is only used for grouped predictors. For ungrouped predictors, the prior is double-exponential with scale ss[2] and mean 0. Note that grouped predictors when family = "multinomial" is still experimental, so use with caution.

Warning

Logical. If TRUE, shows the error messages of not convergence and identifiability.

verbose

Logical. If TRUE, prints out the number of iterations and computational time.

opt.algorithm

One of c("LBFGS", "BFGS", "Newton"). This argument determines which argument is used to optimize the term in the EM algorithm that estimates the probabilities of inclusion for each parameter. Optimization is performed by optimizing.

iar.data

A list of output from mungeCARdata4stan that contains the necessary inputs for the IAR prior. When unspecified, this is built internally assuming that neighbors are those variables directly above, below, left, and right of a given variable location. im.res must be specified when allowing this argument to be built internally. It is not recommended to use this argument directly, even when specifying a more complicated neighborhood stucture; this can be specified with the adjmat argument, and then internally converted to the correct format.

iar.prior

Logical. When TRUE, imposes intrinsic autoregressive prior on logit of the probabilities of inclusion. When FALSE, treats probabilities of inclusion as unstructured.

p.bound

A vector defining the lower and upper boundaries for the probabilities of inclusion in the model, respectively. Defaults to c(0.01, 0.99).

tau.prior

One of c("none", "manual", "cauchy"). This argument determines the precision parameter in the Conditional Autoregressive model for the (logit of) prior inclusion probabilities. When "none", the precision is set to 1; when "manual", the precision is manually entered by the user; when "cauchy", the inverse precision is assumed to follow a Cauchy distribution with mean 0 and scale 2.5. Note that at this stage of development, only the "none" option has been extensively tested, so the other options should be used with caution.

stan_manual

A stan_model that is manually specified. Especially when fitting multiple models in succession, specifying the stan model outside this "loop" may avoid errors.

lambda.criteria

Determines the model selection criteria. When "lambda.min" the final model is selected based on the penalty that minimizes the measure given in type.measure. When "lambda.1se" the final model is selected based on the smallest value of lambda that is within one standard error of the minimal measure given in type.measure.

output_param_est

Logical. When TRUE adds an element to the output list that includes parameter estimates for the fitted model. Defaults is FALSE.

type.multinomial

If "grouped" then a grouped lasso penalty is used on the multinomial coefficients for a variable. This ensures they are all in our out together. The default is "ungrouped"

Value

Either a data frame of model fitness measures or a list whose elements are data frames of model fitness measures and parameter estimates, respectively, depending on the value of output_param_ets.

Examples

xtr <- matrix(rnorm(100*5), nrow = 100, ncol = 5)
xte <- matrix(rnorm(100*5), nrow = 100, ncol = 5)
b <- rnorm(5)

## continuous outcome
ytr <- xtr %*% b + rnorm(100)
yte <- xte %*% b + rnorm(100)

## binary outcome
ybtr <- ifelse(ytr > 0, 1, 0)
ybte <- ifelse(yte > 0, 1, 0)

## multinomial outcome
ymtr <- dplyr::case_when(
  ytr > 1 ~ "a",
  ytr <= 1 & ytr > -1 ~ "b",
  ytr <= -1 ~ "c"
)
ymte <- dplyr::case_when(
  yte > 1 ~ "a",
  yte <= 1 & yte > -1 ~ "b",
  yte <= -1 ~ "c"
)

cv_ssnet(
  model = "ss", family = "gaussian",
  x = rbind(xtr, xte), y = c(ytr, yte),
  s0 = c(0.01, 0.05, 0.10), s1 = c(1, 2.5),
  nfolds = 3, ncv = 2
)

## Not run: 
cv_ssnet(
  model = "ss", family = "binomial",
  x = rbind(xtr, xte), y = c(ybtr, ybte),
  s0 = c(0.01, 0.05, 0.10), s1 = c(1, 2.5),
  nfolds = 3, ncv = 2, classify = TRUE,
  output_param_est = TRUE
)

cv_ssnet(
  model = "ss", family = "multinomial",
  x = rbind(xtr, xte), y = c(ymtr, ymte),
  s0 = c(0.01, 0.05, 0.10), s1 = c(1, 2.5),
  nfolds = 3, ncv = 2, classify = FALSE,
  output_param_est = TRUE
)

## End(Not run)


jmleach-bst/ssnet documentation built on March 4, 2024, 5:04 p.m.