compare_ssnet: Fit Several Models and Compare

View source: R/compare_ssnet.R

compare_ssnetR Documentation

Fit Several Models and Compare

Description

Fit glmnet and/or ssnet models and output measures of model fit for each. Allows multiple scale values for spike-and-slab models.

Usage

compare_ssnet(
  models = c("glmnet", "ss", "ss_iar"),
  alpha = c(0, 0.5, 1),
  model_fit = "all",
  variable_selection = FALSE,
  classify = FALSE,
  classify.rule = 0.5,
  type_error = "kcv",
  nfolds = 10,
  ncv = 1,
  foldid = NULL,
  fold.seed = NULL,
  s0 = seq(0.01, 0.1, 0.01),
  s1 = 1,
  B = NULL,
  x,
  y,
  family = "gaussian",
  offset = NULL,
  epsilon = 1e-04,
  maxit = 50,
  init = NULL,
  group = NULL,
  Warning = FALSE,
  verbose = FALSE,
  opt.algorithm = "LBFGS",
  iar.data = NULL,
  iar.prior = FALSE,
  p.bound = c(0.01, 0.99),
  tau.prior = "none",
  stan_manual = NULL,
  plot.pj = FALSE,
  im.res = NULL,
  nlambda = 100,
  type.measure = c("default", "mse", "deviance", "class", "auc", "mae", "C"),
  lambda.criteria = "lambda.min",
  output_param_est = FALSE,
  output_cv = FALSE
)

Arguments

models

A vector that determines which models to fit. Options include c("glmnet", "ss", "ss_iar"). The default is to fit all three models.

alpha

A scalar value between 0 and 1 determining the compromise between the Ridge and Lasso models. When alpha = 1 reduces to the Lasso, and when alpha = 0 reduces to Ridge.

model_fit

A vector containing measures of model fit to output. Options include c("deviance", "mse", "mae") for all models, and when family = "binomial", also c("auc", "misclassification"). When model_fit = "all", then all appropriate measures of model fit are output.

variable_selection

Logical. When TRUE, outputs the false discovery proportion (FDP), family-wise error (FWE), and power for the model. Requires that parameter vector B be specified. Default is FALSE, and is only appropriate for simulated data, when the true and false positives can be known.

classify

Logical. When TRUE and family = "binomial" applies a classification rule given by the argument classify.rule, and outputs accuracy, sensitivity, specificity, positive predictive value (ppv), and negative predictive value (npv).

classify.rule

A value between 0 and 1. For a given predicted value from a logistic regression, if the value is above classify.rule, then the predicted class is 1; otherwise the predicted class is 0. The default is 0.5.

type_error

Determines whether models are selected based on training error ("training") or k-fold cross validated estimates of prediction error ("kcv"). Defaults to "kcv", which is recommended because training error tends to underestimate the generalization error. See, e.g., Ch. 7 in \insertCiteHastie:2009ssnet.

nfolds

number of folds - default is 10. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is nfolds=3

ncv

repeated number of cross-validation.

foldid

an optional vector of values between 1 and nfolds identifying what fold each observation is in. If supplied, nfolds can be missing.

fold.seed

A scalar seed value for cross validation; ensures the folds are the same upon re-running the function. Alternatively, use foldid to manually specify folds.

s0, s1

A vector of user-selected possible values for the spike scale and slab scale parameter, respectively. The default is s0 = seq(0.01, 0.1, 0.01) and s1 = 1. However, the user should select values informed by the practical context of the analysis.

B

When variable_selection is TRUE, a vector of "true" parameter values must be input in order to calculate the false discovery proportion (FDR), family-wise error (FWER), and power for the data. This vector should NOT contain a value for the intercept.

x

Design, or input, matrix, of dimension nobs x nvars; each row is an observation vector. It is recommended that x have user-defined column names for ease of identifying variables. If missing, then colnames are internally assigned x1, x2, ... and so forth.

y

Scalar response variable. Quantitative for family = "gaussian", or family = "poisson" (non-negative counts). For family = "gaussian", y is always standardized. For family = "binomial", y should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class). For family="cox", y should be a two-column matrix with columns named 'time' and 'status'. The latter is a binary variable, with '1' indicating death, and '0' indicating right censored. The function Surv() in package survival produces such a matrix. When family = "multinomial", y follows the documentation for glmnet, but it is preferred that y is a factor with two or more levels.

family

Response type (see above).

offset

A vector of length nobs that is included in the linear predictor.

epsilon

A positive convergence tolerance; the iterations converge when |dev - dev_old|/(|dev| + 0.1) < e.

maxit

An integer giving the maximal number of EM iterations.

init

A vector of initial values for all coefficients (not for intercept). If not given, it will be internally produced. If family = "multinomial" and the same initializations are desired for each response/outcome category then init can be a vector. If different initializations are desired, then init should be a list, each element of which contains a vector of initializations. The list should be named according the response/outcome category as they appear in y.

group

A numeric vector, or an integer, or a list indicating the groups of predictors. If group = NULL, all the predictors form a single group. If group = K, the predictors are evenly divided into groups each with K predictors. If group is a numberic vector, it defines groups as follows: Group 1: (group[1]+1):group[2], Group 2: (group[2]+1):group[3], Group 3: (group[3]+1):group[4], ... If group is a list of variable names, group[[k]] includes variables in the k-th group. The mixture double-exponential prior is only used for grouped predictors. For ungrouped predictors, the prior is double-exponential with scale ss[2] and mean 0. Note that grouped predictors when family = "multinomial" is still experimental, so use with caution.

Warning

Logical. If TRUE, shows the error messages of not convergence and identifiability.

verbose

Logical. If TRUE, prints out the number of iterations and computational time.

opt.algorithm

One of c("LBFGS", "BFGS", "Newton"). This argument determines which argument is used to optimize the term in the EM algorithm that estimates the probabilities of inclusion for each parameter. Optimization is performed by optimizing.

iar.data

A list of output from mungeCARdata4stan that contains the necessary inputs for the IAR prior. When unspecified, this is built internally assuming that neighbors are those variables directly above, below, left, and right of a given variable location. im.res must be specified when allowing this argument to be built internally. It is not recommended to use this argument directly, even when specifying a more complicated neighborhood stucture; this can be specified with the adjmat argument, and then internally converted to the correct format.

iar.prior

Logical. When TRUE, imposes intrinsic autoregressive prior on logit of the probabilities of inclusion. When FALSE, treats probabilities of inclusion as unstructured.

p.bound

A vector defining the lower and upper boundaries for the probabilities of inclusion in the model, respectively. Defaults to c(0.01, 0.99).

tau.prior

One of c("none", "manual", "cauchy"). This argument determines the precision parameter in the Conditional Autoregressive model for the (logit of) prior inclusion probabilities. When "none", the precision is set to 1; when "manual", the precision is manually entered by the user; when "cauchy", the inverse precision is assumed to follow a Cauchy distribution with mean 0 and scale 2.5. Note that at this stage of development, only the "none" option has been extensively tested, so the other options should be used with caution.

stan_manual

A stan_model that is manually specified. Especially when fitting multiple models in succession, specifying the stan model outside this "loop" may avoid errors.

plot.pj

When TRUE, prints a series of 2D graphs of the prior probabilities of inclusion at each step of the algorithm. This should NOT be used for 3D data.

im.res

A 2-element vector where the first argument is the number of "rows" and the second argument is the number of "columns" in each subject's "image". Default is NULL.

nlambda

The number of lambda values - default is 100.

type.measure

loss to use for cross-validation. Currently five options, not all available for all models. The default is type.measure="deviance", which uses squared-error for gaussian models (a.k.a type.measure="mse" there), deviance for logistic and poisson regression, and partial-likelihood for the Cox model. type.measure="class" applies to binomial and multinomial logistic regression only, and gives misclassification error. type.measure="auc" is for two-class logistic regression only, and gives area under the ROC curve. type.measure="mse" or type.measure="mae" (mean absolute error) can be used by all models except the "cox"; they measure the deviation from the fitted mean to the response. type.measure="C" is Harrel's concordance measure, only available for cox models.

lambda.criteria

Determines the model selection criteria. When "lambda.min" the final model is selected based on the penalty that minimizes the measure given in type.measure. When "lambda.1se" the final model is selected based on the smallest value of lambda that is within one standard error of the minimal measure given in type.measure.

output_param_est

Logical. When TRUE adds an element to the output list that includes parameter estimates for each model fit. Defaults to FALSE.

output_cv

Logical. When TRUE, generates a list with an element for all output generated by cv.bh3. Default is FALSE. The master list has an element for every combination of model, alpha, and s0. Each element is itself a list, the first element being the model fitness measures, and the second element the observed outcomes, fitted outcomes, linear predictor, and fold identifier. When ncv > 1, the latter three will consist of multiple columns, one for each run of k-fold CV; e.g., the column containing the fitted outcomes for the second CV run is y.fitted_2.

criteria

Specifies the criteria for model selection. Options are "deviance", "mse", "mae" for deviance, mean-square error, and mean absolute error, respectively. When family = "binomial", additional options are "auc" and "misclassification", for area under the ROC curve and the percentage of cases where the difference between the observed and predicted values is greater than 1/2.

Value

When output_param_est = FALSE returns a data frame of model fitness summaries. Otherwise, returns a list whose first element is a dataframe whose rows contain parameter estimates for each model fit, and whose second element is a dataframe of model fitness summaries.

Note

Models fit with 'glmnet' never select the penalty/tuning parameter using the training error; however, when type_error = "training", the measure used to compare 'glmnet' with the other models is based on prediction error estimates from training error. That is, model selection within 'glmnet' is still based on k-fold cross validation, even if comparisons with other models is not.

Examples

library(sim2Dpredictr)
## generate data (no intercept)
set.seed(4799623)

## sample size
n <- 30
## image dims
nr <- 4
nc <- 4

## generate data
cn <- paste0("x", seq_len(nr * nc))
tb <- rbinom(nr * nc, 1, 0.05)
tx <- matrix(rnorm(n * nr * nc), nrow = n, ncol = nr * nc,
             dimnames = list(seq_len(n), cn))
ty <- tx %*% tb + rnorm(n)

## build adjacency matrix
adjmat <- proximity_builder(im.res = c(nr, nc), type = "sparse")

## fit multiple models and compare
compare_ssnet(x = tx, y = ty, family = "gaussian",
              alpha = c(0.5, 1), s0 = c(0.01, 0.05),
              type_error = "kcv", nfolds = 3, im.res = c(4, 4),
              model_fit = "all", variable_selection = TRUE,
              B = tb)


jmleach-bst/ssnet documentation built on March 4, 2024, 5:04 p.m.