cssPredict: Wrapper function to generate predictions from cluster...

View source: R/cssr_old.R View source: R/cssr.R

cssPredictR Documentation

Wrapper function to generate predictions from cluster stability selected model in one step

Description

Select clusters using cluster stability selection, form cluster representatives, fit a linear model, and generate predictions from a matrix of unlabeled data. This is a wrapper function for css and getCssPreds. Using cssPredict is simpler, but it has fewer options, and it executes the full (computationally expensive) subsampling procedured every time it is called. In contrast, css can be called just once, and then cssPredict can quickly return results for different matrices of new data or using different values of cutoff, max_num_clusts, etc. by using the calculations done in one call to css.

Usage

cssPredict(
  X_train,
  y_train,
  X_predict,
  clusters = list(),
  lambda = NA,
  cutoff = NA,
  max_num_clusts = NA,
  train_inds = NA
)

Arguments

X_train

An n x p numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing the p >= 2 features/predictors. The data from X_train and y_train will be split into two parts; half of the data will be used for feature selection by cluster stability selection, and half will be used for estimating a linear model on the selected cluster representatives.

y_train

A length-n numeric vector containing the responses; y[i] is the response corresponding to observation X[i, ].

X_predict

A numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing the data that will be used to generate predictions. Must contain the same features (in the same number of columns) as X_train, and if the columns of X_predict are named, they must match the names of X_train.

clusters

Optional; either an integer vector of a list of integer vectors; each vector should contain the indices of a cluster of features (a subset of 1:p). (If there is only one cluster, clusters can either be a list of length 1 or an integer vector.) All of the provided clusters must be non-overlapping. Every feature not appearing in any cluster will be assumed to be unclustered (that is, they will be treated as if they are in a "cluster" containing only themselves). If clusters is a list of length 0 (or a list only containing clusters of length 1), then css() returns the same results as stability selection (so feat_sel_mat will be identical to clus_sel_mat). Names for the clusters will be needed later; any clusters that are not given names in the list clusters will be given names automatically by css. Default is list() (so no clusters are specified, and every feature is assumed to be in a "cluster" containng only itself).

lambda

Optional; the tuning parameter to be used by the lasso for feature selection in each subsample. If lambda is not provided, cssPredict will choose one automatically by cross-validation. Default is NA.

cutoff

Numeric; getCssPreds will make use only of those clusters with selection proportions equal to at least cutoff. Must be between 0 and 1. Default is 0 (in which case either all clusters are used, or max_num_clusts are used, if max_num_clusts is specified).

max_num_clusts

Integer or numeric; the maximum number of clusters to use regardless of cutoff. (That is, if the chosen cutoff returns more than max_num_clusts clusters, the cutoff will be decreased until at most max_num_clusts clusters are selected.) Default is NA (in which case max_num_clusts is ignored).

train_inds

Optional; an integer or numeric vector containing the indices of observations in X and y to set aside for model training after feature selection. If train_inds is not provided, half of the data will be used for feature selection and half for model estimation (chosen at random).

Value

A numeric vector of length nrow(X_predict) of predictions corresponding to the observations from X_predict.

Author(s)

Gregory Faletto, Jacob Bien


gregfaletto/cssr documentation built on March 3, 2023, 1 p.m.