# date: "`r doc_date()`"
# "`r pkg_ver('BiocStyle')`"
# <style>
#     pre {
#     white-space: pre !important;
#     overflow-y: scroll !important;
#     height: 50vh !important;
#     }
# </style>

require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
BiocStyle::markdown()

Introduction

The majority of statistical strategies used in differential expression analysis of high-dimensional biological data (e.g., gene expression, CyTOF) are based on linear models or (analysis of variance). Linear models are favored in many biological problems mainly due to their ease of use and interpretability. However, some biological problems (e.g., gene expression, CyTOF) often require nonlinear models as linear models might be insufficient to capture the relationship among the co-expression features (e.g., relationships among signaling markers within (and across) cell subpopulations in CyTOF data). Kernel-based approaches extend the class of linear models to nonlinear models in a computationally tractable manner. Additionally, kernel-based approaches assume a more general relationship based on a Hilbert space of functions spanned by a certain kernel function instead of assuming a linear functional relationship.

In this vignette, we introduce a differential expression analysis of high-dimensional biological data using kernel based score test, referred to as kernel-based differential analysis (cytoKernel), which can identify differentially expressed features. The cytoKernel R-package contains the CytoK() function, which computes the adjusted p value for each feature based on two groups. Further, the package calculates the shrunken effective size and its corresponding effective size standard deviation (sd) using the Adaptive Shrinkage (ash) procedure (@stephens2017false). We demonstrate with an example data set that the Differential expression analysis using Kernel-based Score test (cytoKernel) procedure can be adapted to flow and mass cytometry experiments along with other biological experiments (e.g., gene expression, RNASeq).

Consider a matrix of order $p \times n$ with elements $x_{ij}$, $i=1,\dots,p$ and $j=1,\dots,n$, where $n$ is the number of samples (Group1+Group2 combined) and $p$ is the number of features.

We observe the data ${x_{ij},y_j}$, where $x_{ij}$ is the median intensity for sample $j$ and feature $i$. $y_j$ (binary response) is the group (or condition) label for sample $j$ ($y_j=0$ (Group1) and $y_j=1$ (Group2).

For feature $i$, $i=1,\dots,p$, we define a simple logistic (semi-parametric) model,

$$ logit[P(y_j=1)] = \beta_0+ f(x_{ij}), $$

where $f(.)$ is a centered smooth function in a Reproducible Kernel Hilbert Space (RKHS) spanned by $k$.

If $H_0:~f(.)=0$, then feature expression value $x_{ij}$ is not related to the group labels $y_j$ for feature $i$ i.e., feature $i$ is differentially expressed.

Let $K$ be a $n \times n$ Gram matrix with $K_{st}=K_{\rho}(x_{sj},~x_{tj})$. Here, $k_{\rho}(.,.)$ is the reproducing kernel of RKHS which contains $f(.)$ and $\rho$ is an unknown kernel parameter.

Let $\mathbf{y}$ be a $n \times 1$ vector of $0$ and $1$. The score test statistic under null hypothesis ($H_0:~f(.)=0$) in the logistic model defined above is, $$ S(\rho) = \frac{Q(\rho)- \mu_{Q}}{\sigma_{Q}},
$$

where $Q(\rho)=(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}\mathbf{K}(\mathbf{y}-\mathbf{\hat{\mu_0}})$, $\mathbf{\hat{\mu_0}}={logit}^{-1}\hat{\beta_0}$ and $\hat{\beta_0}$ is the estimate of $\beta_0$ under null model.

More details about the estimation of $\mu_{Q}$, $\sigma_{Q}$ and choices of $\rho$ in (@zhan2015kernel, @liu2008estimation, @davies1987hypothesis.

$Q(\rho)$ can be rewritten as, $$ Q(\rho) = \sum_s {\sum_t {{k(x_{is},x_{it})}(y_{s}-\hat{\mu_0})(y_{t}-\hat{\mu_0})}}, $$ which is the component-wise product of matrices $\mathbf{K}$ and $(\mathbf{y}-\mathbf{\hat{\mu_0}})(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}$.

We use a Gaussian distance based kernel: $$ k(x_{is},x_{it})= exp\left{-\frac{(x_{is}-x_{it})^2}{\rho}\right}. $$

$S(\rho)$ has a Normal distribution for each value of $\rho$ (@davies1987hypothesis).

The data structure is shown in Figure 1 below.

knitr::include_graphics("cytoSchematic.png")

Getting Started

Load the package in R

library(cytoKernel) 

cytoHD Data

The cytoKernel package contains a pre-processed median marker expressions data SummarizedExperiment assay object of 126 cluster-marker combinations (features) measured in 8 subjects (4 measured before and 4 upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30’, resulting in a total of 8 samples) from (@bodenmiller2012multiplexed) that was also used in the CITRUS paper (@bruggner2014automated) and CATALYST (@crowell2020r). In this vignette, we only used a subset of the original raw cytometry data downloaded from the Bioconductor data package HDCytoData (@weber2019hdcytodata) using the command (Bodenmiller_BCR_XL_flowSet()).

cytoHDBMW data pre-processing

The cytoHDBMW data in the cytoKernel package was pre-processed using the CATALYST Bioconductor package (@crowell2020r). The data pre-processing include $4$ steps and they are as follows: 1. Creating a SingleCellExperiment Object: the flowSet data object along with the metadata are converted into a SingleCellExperiment object using the CATALYST R/Bioconductor package. 2. Clustering: We apply Louvain algorithm using the R package igraph (@csardi2006igraph) to cluster the expression values by the state markers (surface markers) (@traag2019louvain). 3. Median: Medians are calculated within a cluster for every signaling marker and subject using the scuttle Bioconductor package (@mccarthy2017scater). 4. Aggregating and converting the data: We convert the aggregated data into a SummarizedExperiment.

data("cytoHDBMW")
cytoHDBMW

Using the CytoK() function

Input for CytoK()

The CytoK() function must have two object as input: 1. object: a data frame or a matrix or a SummarizedExperiment object with abundance measurements of metabolites (features) on the rows and samples (samples) as the columns. CytoK() accepts objects which are a data frame or matrix with observations (e.g. cluster-marker combinations) on the rows and samples as the columns. 2. group_factor: a binary categorical response variable that represents the group condition for each sample. For example if the samples represent two different groups or conditions (e.g., before stimulation and after stimulation), provide CytoK() with a phenotype representing which columns in the object are different groups. 3. lowerRho: optional a positive value that represents the lower bound of the kernel parameter. Default is 2. 4. upperRho: optional a positive value that represents the upper bound of the kernel parameter. Default is 12. 5. gridRho: optional a positive value that represents the number of grid points in the interval of upper and bound of the kernel parameter. Default is 4. 6. alpha: optional level of significance to control the False Discovery rate (FDR). Default is $0.05$ (i.e., $\alpha=0.05$). 7. featureVars: optional Vector of the columns which identify features. If a SummarizedExperiment is used for data, row variables will be used. Default is NULL.

Running CytoK()

cytoHDBMW SummarizedExperiment example - Identifying differentially expressed features

We apply the CytoK procedure to identify the differentially expressed cluster-marker combinations in the cytoHDBMW data. To run the CytoK() function, we only input the data object and group factor. We obtain 3 outputs after running the CytoK() function. They are shown below:

library(cytoKernel)
CytoK_output<- CytoK(cytoHDBMW,group_factor = rep(c(0, 1),
               c(4, 4)),lowerRho=2,upperRho=12,gridRho=4,
               alpha = 0.05,featureVars = NULL)
CytoK_output
## Head of the data.frame containing shrunken effect sizes, shrunken ##effect size sd's, p values and adjusted p values
head(CytoKFeatures(CytoK_output))
## Head of the data.frame containing shrunken effect sizes, shrunken ##effect size sd's, p values and adjusted p values ordered by ##adjusted p values from low to high
head(CytoKFeaturesOrdered(CytoK_output))
## Percent of differentially expressed features
CytoKDEfeatures(CytoK_output)

Filtering the data by differentially expressed features

## Filtering the data by reproducible features
CytoKDEData_HD<- CytoKDEData(CytoK_output, by = "features")
CytoKDEData_HD

Heatmap of the expression matrix

This heatmap illustrates the expression profiles with differentially expressed features (cluster-marker combinations) on the rows and patients on the columns from the kernel-based score test implemented using the CytoK function. The differentially expressed data (matrix) can be extracted using the (CytoKDEData) function (see above). The generic heatmap of the differentially expressed expression matrix can be plotted using the plotCytoK() function. Any specific meta information can also be added using the ComplexHeatmap package. An illustration is shown below where the cluster ids are separately added onto the heatmap generated by the plotCytoK() function.

heatmap1<- plotCytoK(CytoK_output,
group_factor = rep(c(0, 1), c(4, 4)),topK=10,
featureVars = NULL)
featureOrderedExtracted<- CytoKFeaturesOrdered(CytoK_output)
rowmeta_cluster<- featureOrderedExtracted$cluster
topK<- 10
rowmeta_clusterTopK<- rowmeta_cluster[seq_len(topK)]
library(ComplexHeatmap)
heatmap2<- Heatmap(rowmeta_clusterTopK,
             name = "cluster")
heatmap2+heatmap1

Session Info

sessionInfo() 

References



Ghoshlab/cytoKernel documentation built on Dec. 17, 2021, 9:32 p.m.