rfcca: Random Forest with Canonical Correlation Analysis

View source: R/rfcca.R

rfccaR Documentation

Random Forest with Canonical Correlation Analysis

Description

Estimates the canonical correlations between two sets of variables depending on the subject-related covariates.

Usage

rfcca(
  X,
  Y,
  Z,
  ntree = 200,
  mtry = NULL,
  nodesize = NULL,
  nodedepth = NULL,
  nsplit = 10,
  importance = FALSE,
  finalcca = c("cca", "scca", "rcca"),
  bootstrap = TRUE,
  samptype = c("swor", "swr"),
  sampsize = if (samptype == "swor") function(x) {
     x * 0.632
 } else function(x)
    {
     x
 },
  forest = TRUE,
  membership = FALSE,
  bop = TRUE,
  Xcenter = TRUE,
  Ycenter = TRUE,
  ...
)

Arguments

X

The first multivariate data set which has n observations and px variables. A data.frame of numeric values.

Y

The second multivariate data set which has n observations and py variables. A data.frame of numeric values.

Z

The set of subject-related covariates which has n observations and pz variables. Used in random forest growing. A data.frame with numeric values and factors.

ntree

Number of trees.

mtry

Number of z-variables randomly selected as candidates for splitting a node. The default is pz/3 where pz is the number of z variables. Values are always rounded up.

nodesize

Forest average number of unique data points in a terminal node. The default is the 3 * (px+py) where px and py are the number of x and y variables, respectively.

nodedepth

Maximum depth to which a tree should be grown. In the default, this parameter is ignored.

nsplit

Non-negative integer value for the number of random splits to consider for each candidate splitting variable. When zero or NULL, all possible splits considered.

importance

Should variable importance of z-variables be assessed? The default is FALSE.

finalcca

Which CCA should be used for final canonical correlation estimation? Choices are cca, scca and rcca, see below for details. The default is cca.

bootstrap

Should the data be bootstrapped? The default value is TRUE which bootstraps the data by sampling without replacement. If FALSE is chosen, the data is not bootstrapped. It is not possible to return OOB predictions and variable importance measures if FALSE is chosen.

samptype

Type of bootstrap. Choices are swor (sampling without replacement/sub-sampling) and swr (sampling with replacement/ bootstrapping). The default action here (as in randomForestSRC) is sampling without replacement.

sampsize

Size of sample to draw. For sampling without replacement, by default it is .632 times the sample size. For sampling with replacement, it is the sample size.

forest

Should the forest object be returned? It is used for prediction on new data. The default is TRUE.

membership

Should terminal node membership and inbag information be returned?

bop

Should the Bag of Observations for Prediction (BOP) for training observations be returned? The default is TRUE.

Xcenter

Should the columns of X be centered? The default is TRUE.

Ycenter

Should the columns of Y be centered? The default is TRUE.

...

Optional arguments to be passed to other methods.

Value

An object of class (rfcca,grow) which is a list with the following components:

call

The original call to rfcca.

n

Sample size of the data (NA's are omitted).

ntree

Number of trees grown.

mtry

Number of variables randomly selected for splitting at each node.

nodesize

Minimum forest average number of unique data points in a terminal node.

nodedepth

Maximum depth to which a tree is allowed to be grown.

nsplit

Number of randomly selected split points.

xvar

Data frame of x-variables.

xvar.names

A character vector of the x-variable names.

yvar

Data frame of y-variables.

yvar.names

A character vector of the y-variable names.

zvar

Data frame of z-variables.

zvar.names

A character vector of the z-variable names.

leaf.count

Number of terminal nodes for each tree in the forest. Vector of length ntree.

bootstrap

Was the data bootstrapped?

forest

If forest=TRUE, the rfcca forest object is returned. This object is used for prediction with new data.

membership

A matrix recording terminal node membership where each cell represents the node number that an observations falls in for that tree.

importance

Variable importance measures (VIMP) for each z-variable.

inbag

A matrix recording inbag membership where each cell represents whether the observation is in the bootstrap sample in the corresponding tree.

predicted.oob

OOB predicted canonical correlations for training observations based on the selected final canonical correlation estimation method.

predicted.coef

Predicted canonical weight vectors for x- and y- variables.

bop

If bop=TRUE, a list containing BOP for each training observation is returned.

finalcca

The selected CCA used for final canonical correlation estimations.

rfsrc.grow

An object of class (rfsrc,grow) is returned. This object is used for prediction with training or new data.

Details

Final canonical correlation estimation:

Final canonical correlation can be computed with CCA (Hotelling, 1936), Sparse CCA (Witten et al., 2009) or Regularized CCA (Vinod,1976; Leurgans et al., 1993). If Regularized CCA will be used, \lambda_1 and \lambda_2 should be specified.

References

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

Leurgans, S. E., Moyeed, R. A., & Silverman, B. W. (1993). Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological), 55(3), 725-740.

Vinod, H.D. (1976). Canonical ridge and econometrics of joint production. Journal of econometrics, 4(2), 147–166.

Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.

See Also

predict.rfcca global.significance vimp.rfcca print.rfcca

Examples


## load generated example data
data(data, package = "RFCCA")
set.seed(2345)

## define train/test split
smp <- sample(1:nrow(data$X), size = round(nrow(data$X) * 0.7),
  replace = FALSE)
train.data <- lapply(data, function(x) {x[smp, ]})
test.Z <- data$Z[-smp, ]

## train rfcca
rfcca.obj <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z,
  ntree = 100, importance = TRUE)

## print the grow object
print(rfcca.obj)

## get the OOB predictions
pred.oob <- rfcca.obj$predicted.oob

## predict with new test data
pred.obj <- predict(rfcca.obj, newdata = test.Z)
pred <- pred.obj$predicted

## get the variable importance measures
z.vimp <- rfcca.obj$importance

## train rfcca and estimate the final canonical correlations with "scca"
rfcca.obj2 <- rfcca(X = train.data$X, Y = train.data$Y, Z = train.data$Z,
  ntree = 100, finalcca = "scca")



RFCCA documentation built on Sept. 19, 2023, 9:06 a.m.