phenoRegressor.RFR: Random Forest Regression using package randomForest

phenoRegressor.RFRR Documentation

Random Forest Regression using package randomForest

Description

This is a wrapper around randomForest and related functions. As such, this function will not work if randomForest package is not installed. There is no distinction between regular covariates (genotypes) and extra covariates (fixed effects) in random forest. If extra covariates are passed, they are put together with genotypes, side by side. Same thing happens with covariances matrix. This can bring to the scientifically questionable but technically correct situation of regressing on a big matrix made of SNP genotypes, covariances and other covariates, all collated side by side. The function makes no distinction, and it's up to the user understand what is correct in each specific experiment.

WARNING: this function can be *very* slow, especially when called on thousands of SNPs.

Usage

phenoRegressor.RFR(
  phenotypes,
  genotypes,
  covariances,
  extraCovariates,
  ntree = ceiling(length(phenotypes)/5),
  ...
)

Arguments

phenotypes

phenotypes, a numeric array (n x 1), missing values are predicted

genotypes

SNP genotypes, one row per phenotype (n), one column per marker (m), values in 0/1/2 for diploids or 0/1/2/...ploidy for polyploids. Can be NULL if covariances is present.

covariances

square matrix (n x n) of covariances. Can be NULL if genotypes is present.

extraCovariates

extra covariates set, one row per phenotype (n), one column per covariate (w). If NULL no extra covariates are considered.

ntree

number of trees to grow, defaults to a fifth of the number of samples (rounded up). As per randomForest documentation, it should not be set to too small a number, to ensure that every input row gets predicted at least a few times

...

any extra parameter is passed to randomForest::randomForest()

Value

The function returns a list with the following fields:

  • predictions : an array of (k) predicted phenotypes

  • hyperparams : named vector with the following keys: ntree (number of grown trees) and mtry (number of variables randomly sampled as candidates at each split)

  • extradata : the object returned by randomForest::randomForest(), containing the full trained forest and the used parameters

See Also

randomForest

Other phenoRegressors: phenoRegressor.BGLR(), phenoRegressor.SVR(), phenoRegressor.dummy(), phenoRegressor.rrBLUP(), phenoregressor.BGLR.multikinships()

Examples

## Not run: 
#using the GROAN.KI dataset, we regress on the dataset and predict the first ten phenotypes
phenos = GROAN.KI$yield
phenos[1:10]  = NA

#calling the regressor with random forest
results = phenoRegressor.RFR(
  phenotypes = phenos,
  genotypes = GROAN.KI$SNPs,
  covariances = NULL,
  extraCovariates = NULL,
  ntree = 20,
  mtry = 200 #randomForest-specific parameters
)

#examining the predictions
plot(GROAN.KI$yield, results$predictions,
     main = 'Train set (black) and test set (red) regressions',
     xlab = 'Original phenotypes', ylab = 'Predicted phenotypes')
points(GROAN.KI$yield[1:10], results$predictions[1:10], pch=16, col='red')

#printing correlations
test.set.correlation  = cor(GROAN.KI$yield[1:10], results$predictions[1:10])
train.set.correlation = cor(GROAN.KI$yield[-(1:10)], results$predictions[-(1:10)])
writeLines(paste(
  'test-set correlation :', test.set.correlation,
  '\ntrain-set correlation:', train.set.correlation
))

## End(Not run)

GROAN documentation built on Nov. 28, 2022, 5:07 p.m.