# rfcca: Random Forest with Canonical Correlation Analysis In RFCCA: Random Forest with Canonical Correlation Analysis

## Description

Estimates the canonical correlations between two sets of variables depending on the subject-related covariates.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20``` ```rfcca( X, Y, Z, ntree = 200, mtry = NULL, nodesize = NULL, nodedepth = NULL, nsplit = 10, importance = FALSE, finalcca = c("cca", "scca", "rcca"), bootstrap = TRUE, samptype = c("swor", "swr"), sampsize = if (samptype == "swor") function(x) { x * 0.632 } else function(x) { x }, forest = TRUE, membership = FALSE, bop = TRUE, ... ) ```

## Arguments

 `X` The first multivariate data set which has n observations and px variables. A data.frame of numeric values. `Y` The second multivariate data set which has n observations and py variables. A data.frame of numeric values. `Z` The set of subject-related covariates which has n observations and pz variables. Used in random forest growing. A data.frame with numeric values and factors. `ntree` Number of trees. `mtry` Number of z-variables randomly selected as candidates for splitting a node. The default is pz/3 where pz is the number of z variables. Values are always rounded up. `nodesize` Forest average number of unique data points in a terminal node. The default is the 3 * (px+py) where px and py are the number of x and y variables, respectively. `nodedepth` Maximum depth to which a tree should be grown. In the default, this parameter is ignored. `nsplit` Non-negative integer value for the number of random splits to consider for each candidate splitting variable. When zero or `NULL`, all possible splits considered. `importance` Should variable importance of z-variables be assessed? The default is `FALSE`. `finalcca` Which CCA should be used for final canonical correlation estimation? Choices are `cca`, `scca` and `rcca`, see below for details. The default is `cca`. `bootstrap` Should the data be bootstrapped? The default value is `TRUE` which bootstraps the data by sampling without replacement. If `FALSE` is chosen, the data is not bootstrapped. It is not possible to return OOB predictions and variable importance measures if `FALSE` is chosen. `samptype` Type of bootstrap. Choices are `swor` (sampling without replacement/sub-sampling) and `swr` (sampling with replacement/ bootstrapping). The default action here (as in `randomForestSRC`) is sampling without replacement. `sampsize` Size of sample to draw. For sampling without replacement, by default it is .632 times the sample size. For sampling with replacement, it is the sample size. `forest` Should the forest object be returned? It is used for prediction on new data. The default is `TRUE`. `membership` Should terminal node membership and inbag information be returned? `bop` Should the Bag of Observations for Prediction (BOP) for training observations be returned? The default is `TRUE`. `...` Optional arguments to be passed to other methods.

## Value

An object of class `(rfcca,grow)` which is a list with the following components:

 `call` The original call to `rfcca`. `n` Sample size of the data (`NA`'s are omitted). `ntree` Number of trees grown. `mtry` Number of variables randomly selected for splitting at each node. `nodesize` Minimum forest average number of unique data points in a terminal node. `nodedepth` Maximum depth to which a tree is allowed to be grown. `nsplit` Number of randomly selected split points. `xvar` Data frame of x-variables. `xvar.names` A character vector of the x-variable names. `yvar` Data frame of y-variables. `yvar.names` A character vector of the y-variable names. `zvar` Data frame of z-variables. `zvar.names` A character vector of the z-variable names. `leaf.count` Number of terminal nodes for each tree in the forest. Vector of length `ntree`. `bootstrap` Was the data bootstrapped? `forest` If `forest=TRUE`, the `rfcca` forest object is returned. This object is used for prediction with new data. `membership` A matrix recording terminal node membership where each cell represents the node number that an observations falls in for that tree. `importance` Variable importance measures (VIMP) for each z-variable. `inbag` A matrix recording inbag membership where each cell represents whether the observation is in the bootstrap sample in the corresponding tree. `predicted.oob` OOB predicted canonical correlations for training observations based on the selected final canonical correlation estimation method. `predicted.coef` Predicted canonical weight vectors for x- and y- variables. `bop` If `bop=TRUE`, a list containing BOP for each training observation is returned. `finalcca` The selected CCA used for final canonical correlation estimations. `rfsrc.grow` An object of class `(rfsrc,grow)` is returned. This object is used for prediction with training or new data.

## Details

Final canonical correlation estimation:

Final canonical correlation can be computed with CCA (Hotelling, 1936), Sparse CCA (Witten et al., 2009) or Regularized CCA (Vinod,1976; Leurgans et al., 1993). If Regularized CCA will be used, λ_1 and λ_2 should be specified.

## References

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

Leurgans, S. E., Moyeed, R. A., & Silverman, B. W. (1993). Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological), 55(3), 725-740.

Vinod, H.D. (1976). Canonical ridge and econometrics of joint production. Journal of econometrics, 4(2), 147–166.

Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.

`predict.rfcca` `global.significance` `vimp.rfcca` `print.rfcca`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30``` ```## load generated example data data(data, package = "RFCCA") set.seed(2345) ## define train/test split smp <- sample(1:nrow(data\$X), size = round(nrow(data\$X) * 0.7), replace = FALSE) train.data <- lapply(data, function(x) {x[smp, ]}) test.Z <- data\$Z[-smp, ] ## train rfcca rfcca.obj <- rfcca(X = train.data\$X, Y = train.data\$Y, Z = train.data\$Z, ntree = 100, importance = TRUE) ## print the grow object print(rfcca.obj) ## get the OOB predictions pred.oob <- rfcca.obj\$predicted.oob ## predict with new test data pred.obj <- predict(rfcca.obj, newdata = test.Z) pred <- pred.obj\$predicted ## get the variable importance measures z.vimp <- rfcca.obj\$importance ## train rfcca and estimate the final canonical correlations with "scca" rfcca.obj2 <- rfcca(X = train.data\$X, Y = train.data\$Y, Z = train.data\$Z, ntree = 100, finalcca = "scca") ```