library(mofaCLL)
library(glmnet)
library(DESeq2)

Introduction

This document illustrates how to use the \code{CLLPDestimate} function in the mofaCLL package to estimate the CLL-PD values for samples in a CLL cohort by using gene expression dataset.

Data input

The only input data \code{CLLPDestimate} needs is the gene expression profiling data from either mRNA sequencing or microarray. The rownames are gene identifiers and column names are sample identifiers. Currently, of Ensembl Gene ID or HGNC Gene Symbol are supported as gene identifiers.

For RNA sequencing data, either the variance stabilized counts generated by the \code{:varianceStabilizingTransformation} from DESeq2 package or the transformed counts using \code{voom} from Limma package are recommended. For microarray data, the log2 transformed intensities are recommended.

Here's an example of input dataset

filePath <- system.file("externalData/testCohort.tsv", package = "mofaCLL")
exprMat <- read.table(filePath, header = TRUE)

head(exprMat)

The expression matrix uses Ensembl Gene ID as row names, and the values are variance stabilized counts by using DESeq2.

Estimate CLL-PD

The CLL-PD values for the samples in the expression matrix can be simply estimated by the function CLLPDestimate.

estimateResult <- CLLPDestimate(exprMatrix = exprMat, identifier = "ensembl_gene_id",
                                topVariant = 5000, normalize = TRUE, repeats =20)

The parameter identifier can be "ensembl_gene_id" or "gene_symbol", depends on the type of gene identifiers used for the expression matrix.

Sometimes, filtering out the genes with very low variance can increase the model performance. Users can either filter out the gene with low variance by themselves or specifying a number to the topVariant parameter. The a number is specified, the rows of the expression matrix will be firstly order by their variance, decreasing and only the top n rows will be used.

The parameter normazlied specifies whether each row-wise z-score for the input matrix should be used. The normalization is generally recommended, unless the input matrix is already a z-score matrix.

repeats specifies the number of repetitions for the cross-validation. Higher number will generally lead to better stability but takes longer time. Normally a number between 20 to 100 should be adequate.

How does CLLPDestimate work?

The CLLPDestimate function uses the same process described in method part of the paper, "Multi-omics data integration identifies mTOR-MYC-OXPHOS as a driver of aggressive chronic lymphocytic leukemia" by Lu et al.
Briefly, CLLPDestimate will firstly subset the built-in expression matrix to a reduced one that contains the gene/probes presented in both the built-in and user-specified matrix. Then a repeated cross-validation on LASSO linear regression models will be performed to select a best linear model to predict CLL-PD in the built-in cohort using reduced expression matrix. Finally, the selected model will be apply to the user-specified expression matrix and predict CLL-PD for the samples in user-specified data.

Result interpretation

The output of CLLPDestimate is a list object contain three elements.

estimated_CLLPD is a numeric vector that contains the estimated CLL-PD values for the samples in the user-specified expression matrix

estimateResult$estimated_CLLPD

featureCoefficient is a table that contains the features with non-zero coefficients used by the selected model to estimate CLL-PD

head(estimateResult$featureCoefficient)

featureCoefficient is numeric vector that contains the variance explained values (R2 values) of CLL-PD from repeated cross-validations in the built-in cohort. Because the expression data can be generate by using different platforms with different number of genes measured. This value can be a quality check to see whether the genes provided in the user-specified matrix are good enough to re-capture CLL-PD in the original built-in cohort.

estimateResult$trainingR2

Further information to intepret results.

As we mentioned in our paper, we caution that our current operationalization of the CLL-PD either from the MOFA analysis or from gene expression data CLLPDestimate is unlikely to be optimal. Due to the fact that the platforms the user may use can be very different to the RNAseq platform we used for the training cohort. Furthermore, the estimated CLL-PD values in the user-specified cohort are relative numbers that indicates the relative aggressiveness of the CLL samples within the user-specified cohort. Therefore, we consider the current state of this work as providing a proof of concept that will allow further refinement into a robustly measurable biomarker.



lujunyan1118/mofaCLL documentation built on Dec. 21, 2021, 12:42 p.m.