In leekgroup/phenopredict: Predicting phenotypes from RNA-seq expression data

library(phenopredict)
## Track time spent on making the vignette
startTime <- Sys.time()

Load libraries

## load libraries
library('devtools')
install_github("leekgroup/phenopredict")
# document("/users/sellis/phenopredict")
library('phenopredict')

Load data

data(sysdata, package='phenopredict')   
#loads cm, cm_new, regiondata, pheno to run example

Filter regions

The first step in building the predictor is filtering the regions that should be used for prediction. This step will require an expression data set (expression) in which your samples are in columns and your expression data are in rows. A corresponding GRanges object (regiondata) is required. The rows of this object should correspond to those in your expression data set. Finally, you will have to provide a phenotype file (phenodata). The rows of this phenotype file will contain the samples included in your expression data set and the columns can include any phenotype information. phenotype specifies the phenotype upon which you want to build the predictor. covariates specifies those covariates you would like to include in the predictor.

Note: the speed of this step depends on both the number of regions in your expression set AND (even moreso on) the number of levels in your phenotype being predicted.

# number of regions in expression data 
nrow(cm)

# number of samples included in example
ncol(cm)

inputdata<-filter_regions(expression=cm, regiondata=regiondata ,phenodata=pheno, phenotype="Sex",
    covariates=NULL,type="factor", numRegions=100)

# taking a look at output of filter_regions()
dim(inputdata$covmat)

inputdata$regiondata

head(inputdata$regioninfo)

Build predictor

After selecting the regions for your phenotype of choice, you'll have to calculate the coefficient estimates for these regions. This will output the coefficient regions for your selected regions (rows) for each of the categories of your predictor (columns)

predictor<-build_predictor(inputdata=inputdata ,phenodata=pheno, phenotype="Sex", covariates=NULL,type="factor", numRegions=10)

#number of probes used for prediction
length(predictor$trainingProbes)

#this contains the coefficient estimates used for prediction. 
# the number of rows corresponds to the number of sites used for prediction
# while the columns corresponds to the number of categories of your phenotype.
dim(predictor$coefEsts)

Test predictor

If you want to test the accuracy of your build predictor on data in which the phenotype is known, use test_predictions().

predictions_test <-test_predictor(inputdata=inputdata ,phenodata=pheno, phenotype="Sex", 
    covariates=NULL,type="factor",predictordata=predictor)

# get summary of how prediction is doing
predictions_test$summarized

Extract data

To calculate the predictions for your chosen phenotype in a new data set, you will have to first extract the coverage matrix for the regions from filter_regions() for your new data set. extract_data() supplies this:

# looking at the input data for extract_data
dim(cm_new)

test_data<-extract_data(newexpression=cm_new, newregiondata=regiondata, predictordata=predictor)

Predict phenotype

Finally, with coefficient estimates calculated and the appropriate regions selected from the test data set, you can now make your predictions!

predictions<-predict_pheno(inputdata_test= test_data, phenodata=pheno, phenotype="Sex", covariates=NULL,type="factor", predictordata = predictor)

#looking at the output
table(predictions)

Vignette information

## Time spent creating this report:
diff(c(startTime, Sys.time()))

## Date this report was generated
message(Sys.time())

## Reproducibility info
options(width = 120)
devtools::session_info()

Code for creating the vignette

## Create the vignette
library('rmarkdown')
system.time(render('/users/sellis/phenopredict/vignettes/phenopredict-quickstart.Rmd', BiocStyle::html_document()))

## Extract the R code
library('knitr')
knit('/users/sellis/phenopredict/vignettes/phenopredict-quickstart.Rmd', tangle = TRUE)