Peter W. Eide^1,2,3^, Jarle Bruun^1,2^, Ragnhild A. Lothe^1,2,3^, Anita Sveen^1,2^

^1^ Department of Molecular Oncology, Institute for Cancer Research and ^2^ K.G.Jebsen Colorectal Cancer Research Centre, Oslo University Hospital, Oslo, NO-0424, Norway^3^ Institute for Clinical Medicine, University of Oslo, Oslo, N-0318, Norway

library(Biobase)
library(BiocStyle)
knitr::opts_chunk$set(fig.width=6, fig.height=3, 
        dev.args=list(pointsize=8), dpi=150,
        collapse=TRUE, message=TRUE, echo=TRUE, warnings=FALSE)
options(scipen=-1, digits=2)

Introduction

Colorectal cancers (CRCs) can be divided into four gene expression-based biologically distinct consensus molecular subtypes (CMS) [@guinney_consensus_2015]. This classification provides prognostic stratification of the patients and presents a potential basis for stratified treatment. The original CMS classifier is dependent on gene expression signals from the immune and stromal compartments and often fails to identify the poor-prognostic CMS4 mesenchymal group in immortalized cell lines and patient-derived organoids. CMScaller uses cancer cell-intrinsic, subtype-specific gene expression markers as features for Nearest Template Prediction [@hoshida_nearest_2010].

Input data

CMScaller provides robust cross platform and sample-type performance given a balanced, homogeneous dataset of >40 unique samples.[@eide_cmscaller:_2017; @sveen_colorectal_2018] For less than \~40 samples, sampling variance (by-chance subtype depletion/enrichment) is a concern. Similarly, selection, e.g. excluding microsatellite instable (MSI) samples or including only aggressive cancers, would break an underlying assumption and bias the resulting predictions [@zhao_molecular_2015].

Quick start

Installation and dependencies

The following packages are required in order to run examples in this vignette.

In addition, r Biocpkg("edgeR") is needed for specific RNA-sequencing normalization methods and r CRANpkg(c("parallel", "snow")) for ntp parallelization.

# dependencies: run if not already installed
source("https://bioconductor.org/biocLite.R")
biocLite(c("Biobase", "limma"))
# proper repository to be fixed for publication
install.packages("pathToPackageFile/CMScaller_0.99.2.tar.gz", repos=NULL)

CMS classification

CMScaller function requires an expression matrix or ExpressionSet as input (emat). Gene names in rownames(emat) must be NCBI Entrez, Ensembl or HGNC symbol identifiers. For gene symbols or Ensembl identifiers, parameter rowNames must be set to symbol or ensg, respectively. The code chunk below demonstrates how to perform classification using TCGA primary colorectal cancer example data [@tcga_comprehensive_2012].

[^log]: Hoshida[@hoshida_nearest_2010] does not explicitly state whether input should be log~2~transformed or not and examples in paper include both. Such transformation reduces the weight of genes with large deviations and will affect results at the margins.

library(Biobase) # if input is ExpressionSet
library(CMScaller)
# get RNA-seq counts from TCGA example data
counts <- exprs(crcTCGAsubset)
head(counts[,1:2])
# prediction and gene set analysis
par(mfrow=c(1,2))
res <- CMScaller(emat=counts, RNAseq=TRUE, FDR=0.05)
cam <- CMSgsa(emat=counts, class=res$prediction,RNAseq=TRUE)
# comparison with true class
table(pred=res$prediction, true=crcTCGAsubset$CMS)
head(res, n=3)

Package details

CMScaller is basically a wrapper function for ntp. Similarly, CMSgsa just provides some presets for subCamera.

Preparing custom templates

Templates consists of sets of subtype-specific marker genes. subDEG performs r Biocpkg("limma") differential expression analysis for identification of such markers. Below, is an example on how to prepare custom templates based on a training set with known class labels. doVoom=TRUE enables voom transformation - required for proper limma modeling of RNA-sequencing counts [@law_voom:_2014].

emat <- crcTCGAsubset
cms <- emat$CMS.Syn
train <- sample(seq_along(cms), size=length(cms)/(2))
deg <- subDEG(emat[,train], class=cms[train], doVoom=TRUE)
templates <- ntpMakeTemplates(deg, resDEG=TRUE, topN=50)
templates$symbol <- fromTo(templates$probe)
tail(templates,n=3)

Gene Set Analysis

subCamera provides gene set analysis and visualization and is a wrapper functions for camera in the r Biocpkg("limma") package. camera controls for intra-set gene-wise correlations in order to reduce false-positive rate while retaining statistical power [@wu_camera:_2012; @ritchie_limma_2015]. CMSgsa provides preset gene sets to subCamera.

# increase left margins to accommodate gene set names
par.old <- par()
par(mfrow=c(1,1), mar=par.old$mar+c(0,4,0,0))
subCamera(counts, cms, geneList=geneSets.CRC, doVoom=TRUE)
# restore margins
par(mar=par.old$mar)

PCA Analysis

subPCA and plotPC provide convenient wrapper functions for prcomp. subPCA perform PCA on the input data and plot the resulting low-dimensional projection with samples colored according to either a continuous covariate (e.g. expression of gene of interest) or group (such as CMS). plotPC visualizes the most important variables.

# increase left margins to accommodate gene set names
par(mfrow=c(1,2))
p <- subPCA(emat = crcTCGAsubset, class = crcTCGAsubset$CMS.Syn, 
            normMethod = "quantile", pch=16, frame=FALSE)
plotPC(p, n=6, entrez=TRUE)

Nearest Template Prediction

ntp matches templates$probe against rownames(emat). Missing features and features with NA/NaN's are ignored in the prediction. emat should be row-wise centered and scaled.

# loads included emat, scales and centers
emat <- crcTCGAsubset
emat_sc <- ematAdjust(emat, normMethod="quantile")
head(emat_sc[,1:2])

ntp function requires an expression matrix and templates. Since prediction confidence is estimated from permutations, strict $p$-value reproducibility requires set.seed.

# test set prediction
res <- ntp(emat_sc[,-train], templates, nPerm=1000)
res <- subSetNA(res, pValue=.1)
table(pred=res$prediction, true=cms[-train])
head(res)

ntp output is a data.frame with $3+K$ columns where $K$ is the number of classes. Rows represent columns in input emat.

subSetNA function resets predictions with $p$-value or FDR above some arbitrary threshold to NA.

Nearest Template Prediction

Nearest template prediction (NTP) was proposed as a classification algorithm by Yujin Hoshida and published in PLoS ONE in 2010 [@hoshida_nearest_2010]. It aims to provide robust single-sample class prediction for high-dimensional, noisy gene expression data. In brief, first, for each subclass, a template, a list of genes coherently upregulated is determined. Then, for each sample, the distance to each template is calculated and class is assigned based on the smallest distance. Finally, prediction confidence is assessed based on the distance of the null-distribution, estimated from permutation tests (feature permutation). The default distance metric selected by Hoshida was a cosine similarity-derived distance (see below). When applied to a reasonably well-balanced homogeneous dataset, row-wise centering and scaling is performed (gene means and standard deviations $\mu=0$, $\sigma=1$). In case of single-sample prediction, feature-wise means and standard deviations from a previous sample set are used to perform sample-wise scaling and centering. The key advantages of the NTP algorithm are conceptual simplicity, biological plausibility, ease of implementation and robustness.

Formally, $N$ samples with expression values for $P$ genes divided into $K$ different classes.

For the sample and template vectors ${x}$ and ${y}$, a proper distance metric, $d_{x,y}$ for the similarity function $f(x,y)$ is given by $d=\sqrt{\frac{1}{2}(1-f(x,y))}$ [@van_dongen_metric_2012]. Here $f$ is either cosine, Kendall, Pearson or Spearman correlation. Cosine similarity, the angle between two Euclidean vectors is given by $$f(x,y)=\cos{(\theta)}=\frac{\sum{xy}}{\sqrt{\sum{x^2}}{\sqrt{\sum{y^2}}}}$$

The following code chunks demonstrate NTP in code.

# random centered/scaled expression matrix and templates
set.seed(42)
N <- 5000;P <- 5000;K <- 4;nPerm <- 1000;n <- 1
X <- matrix(rnorm(P*N, mean=0, sd=1), ncol=N)
Y <- matrix(rbinom(P*K, size=1, prob=.01), ncol=K)
# sample-template correlations (implemented in corCosine)
cos.sim <- crossprod(X,Y) / outer(
                sqrt(apply(X, 2, crossprod)), 
                sqrt(apply(Y, 2, crossprod)))
# sample-template distances (vectorized)
simToDist <- function(cos.sim) sqrt(1/2 * (1-cos.sim))
cos.dist <- simToDist(cos.sim)
hist(cos.dist, xlab="cosine correlation distance")

For centered, scaled and uncorrelated data, the cosine correlation distance is $$\sqrt{0.5\times(1-0)}\approx0.707$$

Resulting distances are ranked among distances of permutated samples and used to estimate prediction confidence. The lowest possible $p$-value estimate is therefore $1/permutations$

# estimate prediction confidence
pred.class <- apply(cos.dist, 1, which.min)
pred.dists <- apply(cos.dist, 1, min)
null.dist <- replicate(nPerm, min(simToDist(corCosine(sample(X[,n]), Y))))
p <- rank(c(pred.dists[n], null.dist))[1]/(length(null.dist))

Code and plot below illustrate the uniform $p$-value distribution for centered and scaled uncorrelated input[^pval].

[^pval]: present NTP implementation provides more conservative $p$-value estimates than Hoshida[@hoshida_nearest_2010].

# rearrange matrix and templates for ntp input
rownames(X) <- make.names(seq_len(P))
templates <- lapply(seq_len(K), function(k) rownames(X)[Y[,k]==1])
names(templates) <- paste("k", seq_len(K))
templates <- ntpMakeTemplates(templates, resDEG=FALSE)
# permutations set to 100 to reduce processing for vignette
par(mfrow=c(1,2))
res <- ntp(X, templates, nCores=1L, nPerm=100, doPlot=TRUE)
# expect uniform distribution
hist(res$p.value, main ="", xlab="prediction p-values")

Notes

\clearpage

Session

toLatex(sessionInfo())

References



peterawe/CMScaller documentation built on June 13, 2020, 4:49 a.m.