phenoRegressor.RFR: Random Forest Regression using package randomForest
In GROAN: Genomic Regression Workbench

phenoRegressor.RFR

R Documentation

Random Forest Regression using package randomForest

Description

This is a wrapper around randomForest and related functions. As such, this function will not work if randomForest package is not installed. There is no distinction between regular covariates (genotypes) and extra covariates (fixed effects) in random forest. If extra covariates are passed, they are put together with genotypes, side by side. Same thing happens with covariances matrix. This can bring to the scientifically questionable but technically correct situation of regressing on a big matrix made of SNP genotypes, covariances and other covariates, all collated side by side. The function makes no distinction, and it's up to the user understand what is correct in each specific experiment.

WARNING: this function can be *very* slow, especially when called on thousands of SNPs.

Usage

phenoRegressor.RFR(
  phenotypes,
  genotypes,
  covariances,
  extraCovariates,
  ntree = ceiling(length(phenotypes)/5),
  ...
)

Arguments

`phenotypes`	phenotypes, a numeric array (n x 1), missing values are predicted
`genotypes`	SNP genotypes, one row per phenotype (n), one column per marker (m), values in 0/1/2 for diploids or 0/1/2/...ploidy for polyploids. Can be NULL if `covariances` is present.
`covariances`	square matrix (n x n) of covariances. Can be NULL if `genotypes` is present.
`extraCovariates`	extra covariates set, one row per phenotype (n), one column per covariate (w). If NULL no extra covariates are considered.
`ntree`	number of trees to grow, defaults to a fifth of the number of samples (rounded up). As per `randomForest` documentation, it should not be set to too small a number, to ensure that every input row gets predicted at least a few times
`...`	any extra parameter is passed to `randomForest::randomForest()`

Value

The function returns a list with the following fields:

predictions : an array of (k) predicted phenotypes
hyperparams : named vector with the following keys: ntree (number of grown trees) and mtry (number of variables randomly sampled as candidates at each split)
extradata : the object returned by randomForest::randomForest(), containing the full trained forest and the used parameters

Examples

## Not run: 
#using the GROAN.KI dataset, we regress on the dataset and predict the first ten phenotypes
phenos = GROAN.KI$yield
phenos[1:10]  = NA

#calling the regressor with random forest
results = phenoRegressor.RFR(
  phenotypes = phenos,
  genotypes = GROAN.KI$SNPs,
  covariances = NULL,
  extraCovariates = NULL,
  ntree = 20,
  mtry = 200 #randomForest-specific parameters
)

#examining the predictions
plot(GROAN.KI$yield, results$predictions,
     main = 'Train set (black) and test set (red) regressions',
     xlab = 'Original phenotypes', ylab = 'Predicted phenotypes')
points(GROAN.KI$yield[1:10], results$predictions[1:10], pch=16, col='red')

#printing correlations
test.set.correlation  = cor(GROAN.KI$yield[1:10], results$predictions[1:10])
train.set.correlation = cor(GROAN.KI$yield[-(1:10)], results$predictions[-(1:10)])
writeLines(paste(
  'test-set correlation :', test.set.correlation,
  '\ntrain-set correlation:', train.set.correlation
))

## End(Not run)

GROAN documentation built on Nov. 28, 2022, 5:07 p.m.