knitr::opts_chunk$set(echo = TRUE)

Overview

Pseudotime analysis based on single-cell RNA-seq (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes such as cell differentiation, immune responses, and disease development. Existing pseudotime analysis methods primarily address the issue of reconstructing cellular pseudotemporal trajectories and inferring gene expression changes along the reconstructed trajectory in one biological sample. As scRNA-seq studies are increasingly performed on multiple patient samples, comparing gene expression dynamics across samples has been emerging as a new demand for which a systematic analytical solution is lacking.

We develop a systematic computational and statistical framework, Lamian, for multi-sample pseudotime analysis. Given scRNA-seq data from multiple biological samples with covariates (e.g., age, sex, sample type, disease status, etc.), this framework allows one to (1) construct cellular pseudotemporal trajectory, evaluate the uncertainty of the trajectory branching structure, (2) evaluate changes in branching structure associated with sample covariates, (3) identify changes in cell abundance and gene expression along the pseudotime, and (4) characterize how sample covariates modifies the pseudotemporal dynamics of cell abundance and gene expression. Importantly, when identifying cell abundance or gene expression changes associated with pseudotime and sample covariates, Lamian accounts for variability across biological samples which other existing pseudotime methods do not consider. As a result, Lamian is able to more appropriately control the false discovery rate (FDR) when analyzing multi-sample data, a property not offered by any other methods.

For more details, see our paper describing the Lamian package:

Download

Please follow the details here https://github.com/Winnie09/Lamian to download Lamian package and then load it using the following commands.

# load in Lamian
options(warn=-1)
suppressMessages(library(Lamian))

Module 1: tree variability

The module 1 of Lamian is designed for detecting the stability of branches in a pseudotime tree structure. We automatically enumerate all branches from the pseudotime tree structure, and then test for their detection rate through 10,000 bootstraps. After each cell-bootstrapping, we reconstruct the tree structure and re-identify all branches. We apply both Jaccard statistics and overlap coefficient as the statistics for evaluating whether any branch in a bootstrap setting matches with one of the original branches. A branch's detection rate is defined as the percentage of bootstrap settings that a branch finds it match. Module 1 also reports and tests for samples' proportion in each branch.

Data preparation

We will need a gene by cell expression matrix, the low-dimension representation of the cells, and the cell annotation (which sample each cell belongs to). Here, we use example data hca_bm_saver, hca_bm_pca, hca_bm_cellanno to demonstrate their data structures. Users can read in their own data of interest in this step.

data(man_tree_data)

man_tree_data is a list containing gene by cell expression matrix, low-dimension representation (PCA) of the cells, and the cell annotation where the first column are the cell names, the second column are the sample names, and the row names are the cell names.

str(man_tree_data)

Infer tree structure

set.seed(12345)
res = infer_tree_structure(pca = man_tree_data[['pca']], 
                           expression = man_tree_data[['expression']], 
                           cellanno = man_tree_data[['cellanno']], 
                           origin.marker = c('CD34'), 
                           number.cluster = 5,
                           xlab='Principal component 1', 
                           ylab = 'Principal component 2')

As we can see from the above inputs, there are five clusters, and the tree structure inferred based on these clusters consists of three branches: 2 --> 3 --> 1, 2 --> 4, and 2 --> 5. The result object res is a list containing information about the tree structure, branches, cell clusters, pseudotime ordering the cells, etc..

names(res)

Plot the tree structure

Lamian::plotmclust(res, cell_point_size = 0.5, 
                   x.lab = 'Principal component 1', 
                   y.lab = 'Principal component 2')

Evaluate the uncertainty of tree branches

We can call the function evaluate_uncertainty() to evaluate the tree topology uncertainty. We suggested that users set n.permute = 100 or more to ensure enough randomness to construct the null distribution, but here we set n.permute = 5 in order to provide a simplified example.

result <- evaluate_uncertainty(res, n.permute=5)
names(result)

Since for the simplicity of the example we set n.permute = 5 only, the branch proportions might not really make sense. When users set n.permute = 100 or more in their real applications, results will make much more sense. The result is a list of three elements. The first selement is the detection rate of each branch.

result[[1]]

The second element is the sample proportion mean information.

result[[2]]

The thrid element is the sample proportion sd (standard deviation) information.

result[[3]]

Module 2: Evaluate differential topoloty

Module 2 of Lamian first identifies variation in tree topology across samples and then assesses if there are differential topological changes associated with sample covariates. For each sample, Lamian calculates the proportion of cells in each tree branch, referred to as branch cell proportion. Because a zero or low proportion can reflect absence or depletion of a branch, changes in tree topology can be described using branch cell proportion changes. With multiple samples, Lamian characterizes the cross-sample variation of each branch by estimating the variance of the branch cell proportion across samples. Furthermore, regression models can be fit to test whether the branch cell proportion is associated with sample covariates. This allows one to identify tree topology changes between different conditions, for example in a case-control cohort.

data = result[[2]]
rownames(data) <-
  c('HSC->lymphocyte', 'HSC->myeloid', 'HSC->erythroid')
design = data.frame(sample = paste0('BM', 1, 8), sex = c(0, rep(1, 4), rep(0, 3)))
branchPropTest(data = data, design = design)

Module 3: Trajectory differential tests about gene expression

XDE test

Data preparation

Gene expression in a matrix form

In the following, we will use an example dataset expdata of 100 genes and 1000 cells to demonstration the workflow. In practice, we can input any dataset of interest.

data(expdata)

The inputs should contain: (a) expr: a gene by cell expression matrix. The entires are library-size-normalized log-transformed gene expression values. They can be either imputed or non-imputed. Zero-expression genes should have been filtered out.

str(expdata$expr)

(b) cellanno: a dataframe where the first column are cell names and second column are sample names.

head(expdata$cellanno)

(c) pseudotime: a numeric vector of pseudotime, and the names of this vector are the corresponding cell names.

str(expdata$pseudotime)

(d) design: a matrix or a data frame. The number of rows equals to the number of unique samples. Rownames are sample names. The number of columns depends on the number of sample covariates. The first column is the intercept (all 1s). The second column is the realization values of the first sample-level covariate (e.g. sample groups). Other columns are the realization values for other sample-level covariates.

print(expdata$design)

The function lamian.test() is designed to perform multiple tests. To perform the Sample Covariate Test, we need to set test.type = 'variable'.

Gene expression in hdf5 format (for ultra-large datasets)

For ultra-large datasets, for example when there are hundreds of samples each of which consists of hundreds to thousands of cells, we recommend using the hdf5 format to store the gene expression information instead of a matrix. This step significantly reduces the memory usage of each round, and therefore in a parallel mode the program can run more rounds and be completed at a faster speed.

To save the gene expression in hdf5 file format, we apply the following codes which creates a new file data/multi.h5 (file multi.h5 in the folder data/ )

saveh5(expr = expdata$expr, pseudotime = expdata$pseudotime, cellanno = expdata$cellanno, path = 'data/multi.h5')

Perform XDE test

We can perform XDE test using four inputs expr, cellanno, pseudotime, and design.

By default, test.method = 'permutation'. The default permutation is 100, i.e. permuiter = 100, but in this small example we set it as permuiter = 5 only to save the running time. We recommend setting permuiter = 100 or higher in real applications. We also provide test.method = 'chisq'.

Here, we set testvar = 2 to test the second column of the design matrix as the sample covariate while adjusting for other columns (except the intercept).

If the gene expression is in a regular matrix format,

Res <- lamian_test(
  expr = expdata$expr,
  cellanno = expdata$cellanno,
  pseudotime = expdata$pseudotime,
  design = expdata$design,
  test.type = 'variable',
  testvar = 2,
  permuiter = 5,
  ## this is for permutation test only. Alternatively, we can use test.method = 'chisq' to swich to the chi-square test.
  ncores = 1
)

If the gene expression is in an hdf5 file format,

Res <- lamian_test_h5(
  expr = 'data/multi.h5',
  cellanno = expdata$cellanno,
  pseudotime = expdata$pseudotime,
  design = expdata$design,
  test.type = 'variable',
  testvar = 2,
  permuiter = 5,
  ## this is for permutation test only. Alternatively, we can use test.method = 'chisq' to swich to the chi-square test.
  ncores = 1
)

Downstream analysis 1: visualize and cluster XDE genes based on their multiple-sample temporal patterns across sample covariates

We will know which are XDE from the statistics data frame in the result object Res.

## get differential dynamic genes statistics
stat <- Res$statistics
stat <- stat[order(stat[, 1],-stat[, 3]),]
## identify XDE genes with FDR.overall < 0.05 cutoff
diffgene <-
  rownames(stat[stat[, grep('^fdr.*overall$', colnames(stat))] < 0.05, ])

We will get the population-level estimates for all the XDE genes by applying function getPopulationFit().

## population fit
Res$populationFit <-
  getPopulationFit(Res, gene = diffgene, type = 'variable')

We can apply getCovariateGroupDiff() to calculate the group difference regarding this sample covariate, and then cluster the XDE genes based on the group difference. By setting k = 4, we will get two clusters for meanSig XDE genes, and the other two clusters for XDE genes of other types.

## clustering
Res$covariateGroupDiff <-
  getCovariateGroupDiff(testobj = Res, gene = diffgene)
Res$cluster <-
  clusterGene(Res, gene = diffgene, type = 'variable', k = 5)

We can also plot original expression values, model fitted expression values, and model-fitted group difference in three seperate heatmaps.

colnames(Res$populationFit[[1]]) <-
  colnames(Res$populationFit[[2]]) <- colnames(Res$expr)
plotXDEHm(
  Res,
  cellWidthTotal = 180,
  cellHeightTotal = 350,
  subsampleCell = FALSE,
  sep = ':.*'
)

We can plot the cluster mean and difference.

## plotClusterMeanAndDiff
plotClusterMeanAndDiff(Res)

We can also plot the cluster difference seperately by calling plotClusterDiff(testobj=Res, gene = diffgene).

Downstream analysis 2: Gene ontology (GO) enrichment analysis

We can further check out whether the XDE genes in each cluster have any enriched genome functions by applying Gene Ontology (GO) analysis library(topGO); goRes <- GOEnrich(testobj = Res, type = 'variable'). Please note that there are no enriched GO terms in any of the clusters in this simplified example since we only subset 1,000 genes. In practice, if there are significant GO terms in the clusters, we can generate a heatmap of the top GO terms using the command plotGOEnrich(goRes = goRes).

TDE test

Data preparation

The inputs for Constant Time Test are the same as those for Sample Covariate Test (see above section), except that the design matrix can have only one intercept column. If there are more than one columns in design, only the first column will be considered.

Perform TDE test

Similar to the XDE test, we can perform the TDE test using four inputs expr, cellanno, pseudotime, and design. The only difference is that design can be a one-column matrix or dataframe whose values are 1s. If there are multiple columns, only the first column will be actually used. The default permutation is 100, i.e. permuiter = 100, but in this small example we set it as permuiter = 5 again to save the running time. We recommend setting permuiter = 100 or higher in real applications.

If the gene expression is in a matrix format,

Res <- lamian_test(
  expr = expdata$expr,
  cellanno = expdata$cellanno,
  pseudotime = expdata$pseudotime,
  design = expdata$design,
  test.type = 'time',
  permuiter = 5
)

If the gene expression is in an hdf5 file format,

Res <- lamian_test_h5(
  expr = 'data/multi.h5',
  cellanno = expdata$cellanno,
  pseudotime = expdata$pseudotime,
  design = expdata$design,
  test.type = 'time',
  permuiter = 5
) ## this is for permutation test only. Alternatively, we can use test.method = 'chisq' to swich to the chi-square test. 

Downstream analysis 1: visualize and cluster TDE genes based on their multiple-sample temporal patterns

The result object is a list containing multiple elements. The first element is a dataframe of statistics.

head(Res$statistics)

We can further determine the TDE genes as the genes with fdr.overall $< 0.05$.

diffgene <- rownames(Res$statistics)[Res$statistics[, 1] < 0.05]

We can estimate the population-level model-fitted patterns for all TDE genes.

Res$populationFit <-
  getPopulationFit(Res, gene = diffgene, type = 'time')

Then we can further cluster these genes.

Res$cluster <-
  clusterGene(Res, gene = diffgene, type = 'time', k = 3)

We can visualize the temporal patterns and compare original and fitted expression

plotTDEHm(
  Res,
  subsampleCell  = FALSE,
  showCluster = TRUE,
  type = 'time',
  cellWidthTotal = 200,
  cellHeightTotal = 200
)

To plot cluster mean patterns.

plotClusterMean(testobj = Res,
                cluster = Res$cluster,
                type = 'time')

Downstream analysis 2: GO analysis for TDE genes

Similar to XDE test, we can further proceed to identify the enriched GO terms for each gene cluster using commands library(topGO); goRes <- GOEnrich(testobj = Res, type = 'time', sep = ':.*'). Please note that there are no enriched GO terms in any of the clusters in this simplified example since we only subset 1,000 genes. In practice, if there are enriched GO terms, we can plot them using commands plotGOEnrich(goRes = goRes).

Module 4: trajectory differential test based on cell composition

TCD test

Data preparation

The inputs are the same as those for TDE test (see above), except that we don not need gene expression matrix.

Perform cell proportion test

Res <-
  cellPropTest(
    cellanno = expdata$cellanno,
    pseudotime = expdata$pseudotime,
    design = expdata$design[, 1, drop = F],
    ncores = 4,
    test.type = 'Time'
  )

The dataframe statistics in the Res object contains the test result: a p-value pval.overall and a effect size z-score z.overall.

head(Res$statistics)

XCD test

The inputs are the same for XDE test (see above), except that we do not need gene expression matrix.

Res <-
  cellPropTest(
    cellanno = expdata$cellanno,
    pseudotime = expdata$pseudotime,
    design = expdata$design,
    ncores = 4,
    test.type = 'Variable',
    testvar = 2
  )

The dataframe statistics in the Res object contains the test result: a p-value pval.overall and a effect size z-score z.overall.

Session Info

sessionInfo()

Citation

If the Lamian package is useful in your work, please cite the following paper:

Maintaince or issue reports

Should you encounter any bugs or have any suggestions, please feel free to contact Wenpin Hou whou10@jhu.edu, or open an issue on the Github page https://github.com/Winnie09/Lamian/issues.



Winnie09/Lamian documentation built on July 1, 2024, 8:04 p.m.